What Is an Inference Runner?

An inference runner is a component in large-scale software systems that lets you plug in machine learning (ML) algorithms (or “models”) to deliver data into those algorithms and calculate outputs. It enables the ML inference phase of the ML lifecycle. Since an ML model is simply software code that represents the algorithm alone, it needs surrounding application infrastructure to do work (i.e., the aforementioned large-scale software system). An inference runner makes it easier to put a trained ML model into a host system that provides that application infrastructure. This host system could be as simple as a clustered web application that receives data via a REST interface or as comprehensive as a stream processing framework that feeds real-time data directly to the model.

How Does an Inference Runner Work?

The core functionality of an inference runner includes an application programming interface (API) that lets the ML model integrate with the host system. This is how the host system communicates with the model and ensures data is sent in a way that the model will understand. In some cases, extra infrastructure like a virtual machine (VM) is required to run the ML code. The inference runner is responsible for maintaining that infrastructure, which often includes provisions for high availability and security.

Why Is an Inference Runner Necessary?

An inference runner is necessary when the host system is based on a programming language that is different from that of the ML model. For example, if your host is written in Java and your model is in Python, that creates more difficulty in integrating the two. An inference runner provides the interface so that the host can treat the model almost as if it were natively embedded into the host. The ML model is run in a separate process from the host system, and the inference runner manages the interprocess communication, typically through a protocol like RPC. Developers are, therefore, not required to build the communications layer themselves.

inference runner
An inference runner is used when the host system is based on a programming language that is different from that of the ML model. It provides the interface so that the host can treat the model almost as if it were natively embedded into the host.

An inference runner for the Python programming language is one of the most useful types of inference runners because Python is the most popular language used by the data scientists who build and train models. At the same time, many data management platforms that can host the ML models are written in Java, creating a language mismatch. In this situation, the inference runner acts as the management layer for running the Python model in a host system based on Java. Since Python is an interpreted language, its code needs to be run inside a Python VM. Therefore, an inference running on a Java host must orchestrate the deployment of Python VMs to let the Python ML models run. Hazelcast Jet is an example of a stream processing engine that supports Python ML models. It uses gRPC to automatically set up and run the Python VMs and send data to the Python code, so DevOps and data engineers do not have to maintain the Python VMs independently.

Related Topics

Machine Learning Inference

Relevant Resources

Video

Spotlight on Stream Processing and Machine Learning

David Brimley, Financial Services Industry Consultant, Hazelcast, speaks to FinextraTV about what financial services firms are doing with machine learning and what firms should consider as they progress through their machine learning journey. He explains how streaming data fits in financial services, how firms can ease into streaming without going through a complete re-architecture of their systems and how financial services technologists need to keep an eye on developments in In-memory computing, Cloud and Containerization."
Webinar
| Video
| 60 minutes

Tech Talk: Machine Learning at Scale Using Distributed Stream Processing

In this talk, Marko will show one approach which allows you to write a low-latency, auto-parallelized and distributed stream processing pipeline in Java that seamlessly integrates with a data scientist's work taken in almost unchanged form from their Python development environment. The talk includes a live demo using the command line and going through some Python and Java code snippets.
Webinar
| Video
| 60 minutes

Key Considerations for Optimal Machine Learning Deployments

Machine learning (ML) is being used almost everywhere, but the ubiquity has not been equated with simplicity. If you solely consider the operationalization aspect of ML, you know that deploying your models into production, especially in real-time environments, can be inefficient and time-consuming. Common approaches may not perform and scale to the levels needed. These challenges are especially true for businesses that have not properly planned out their data science initiatives.
View All Resources