Machine Learning Inference at Scale
Machine learning projects can be split into two phases:

- Training
- Inference
During the training phase, data science teams have to obtain, analyze and understand available data and generalize it into a mathematical model. The model uses the features of the sample data to reason about data it has never seen. Although it can be completely custom code, it is usually based on proven machine learning algorithms, such as Naïve Bayes, K Means, Linear Regression, Deep Learning, Random Forests or Decision Trees. The act of building the model from the sample (training) data is referred to as training.
The machine learning inference phase refers to using the model to predict an unknown property of the input data. This requires deploying the model into a production environment and operating it.
Operating Machine Learning
The most straightforward way to deploy the model is to wrap it in a REST web service and let other applications remotely invoke the inference service. Many machine learning frameworks provide such a service out-of-the-box to support simple deployments that don’t deal with much data.
What Hazelcast Jet adds to this story is a simple way to deploy the model so that it is automatically parallelized and scaled out across a cluster of machines.

Jet uses its parallel, distributed and resilient execution engine to turn the model into a high-performance inference service. To use all available CPU cores, Jet spins up multiple parallel instances of the model and spreads the inference requests among them. The Jet cluster is elastic; to scale with the workload, add or remove cluster members on the fly with no downtime.
Another trick is using a pipelined design instead of a request-reply pattern. It allows Jet to batch inference requests together and reduce fixed overheads of serving each request individually. This improves the overall throughput of the model significantly! The pipelined design requires a change in the client’s workflow. Instead of calling the inference service directly, it sends its inference request to an inbox. It may be implemented using a message broker such as JMS topic, Kafka or distributed topic of Hazelcast. Jet watches the inbox and groups multiple requests together to use the model service efficiently. It uses smart batching where the batch size changes with the data volume to keep the latency always low. The inference results are published to an outbox for callers to pick it up.
Models Supported
Python models
Python is the lingua franca of the data science world. There is a wide ecosystem of libraries and tools to build and train models in Python: TensorFlow, Keras, Theano, Scikit-learn or PyTorch to name a few. Jet can host any Python model.
Upon model deployment, Jet’s JVM runtime launches Python processes and establishes bi-directional gRPC communication channels to stream inference requests through it. So, the model runs natively in a Python process that is completely managed by Jet. It can be tuned to spin multiple Python processes on each machine to make use of multicore processors.
Jet makes sure that the Python code is distributed to all machines that participate in the cluster. If you add another machine to a Jet cluster, it creates a directory on it and deploys the Python code there. Moreover, Jet can install all required Python libraries to prepare the runtime for your Python model.
Documentation: https://docs.hazelcast.org/docs/jet/latest/manual/#map-using-python
Code sample: https://github.com/hazelcast/hazelcast-jet/tree/master/examples/python
Java models
Java models are used for high-performance inference execution. Favourite Java model libraries include JPMML, TensorFlow for Java, MXNet, XGBoost JVM Package and H2O.
Similarly to Python, the model is packaged as a Jet Job resource. The Job usually includes model inference code (the ML library) and a serialized model. Jet runs the Java models in-process with the cluster members so there is no need to start extra processes and there is no communication overhead (serialization, deserialization, networking). This makes Java model the best performing option. The inference job can be configured to use one model instance per JVM or multiple model instances.
Documentation: https://docs.hazelcast.org/docs/jet/latest/manual/#machine-learning-model-prediction
Code samples:
Remote services
We started this article by saying that using a model as an RPC service is simple but requires extra effort when scaling. Jet supports this pattern, too. The Jet Job can invoke a remote inference service. The model isn’t managed by Jet in this case, so the operational and performance advantages are gone. Jet still provides the convenience of smart batching, inbox/outbox connectors and many pipeline operators. Smart batching works only if the RPC service can operate on batches of input items.
Benefits of this setup
- Isolating the model service and the data pipeline
- Sharing the model among many Jet pipelines
Code samples:
Execution Mode Overview
Execution Mode
|
Java Model
|
Python Model
|
Remote Model
|
---|---|---|---|
Model managed by Jet | ![]() |
![]() |
![]() |
Model shared between Jobs | ![]() |
![]() |
![]() |
Jet ↔ Model Communication | Shared memory | gRPC (processes colocated) |
Usually gRPC or REST (processes on different machines) |
Throughput (single node) | 1M / sec | 50k / sec | Depends |
Requirements | Model runs in JVM | Python runtime installed on all cluster machines | Model service started prior to the Jet Job
Model available as a RPC service |
Framework Integration Overview
Framework
|
Execution Mode
|
Code Sample
|
---|---|---|
H2O | Java | Code Sample |
TensorFlow for Java | Java | Code Sample |
Custom Java Model | Java | Code Sample |
PMML | Java | N/A, use JPMML Evaluator as a Custom Java Model |
MXNet | Java | N/A, use MXNet Java Inference API as a Custom Java Model |
XGBoost | Java | N/A, use XGBoost JVM Package as a Custom Java Model |
Keras, Theano, Scikit-learn or PyTorch | Python | N/A, use the Custom Python Model |
Custom Python Model | Python | Code Sample |
Remote gRPC service | Remote | Code Sample |
Remote TensorFlow service | Remote | Code Sample |