Blog ›Machine Learning Inference at Scale

By Vladimir Schreiner

Vladimir is a product manager with an engineering background and deep expertise in stream processing and real-time data pipelines. Ten years of building internal software platforms and development infrastructure have made him passionate about new technologies and finding ways to simplify data processing. Vladimir co-authored two white papers on the topic: Understanding Stream Processing: Fast Processing of Infinite and Big Data, and A Reference Guide to Stream Processing. His tutorial video on stream processing and real-time data pipelines discusses the building blocks of a stream processing pipeline and demonstrates how developers can write a full-blown streaming pipeline in less than a hundred lines of Java code for a variety of applications. Vladimir is also a lecturer with the Czechitas Foundation, whose mission is to inspire women and girls to explore the world of information technology. Czechitas Foundation teaches coding in various programming languages, software testing, and data analysis.

View all blogs by the author

Mar 10, 2020

Back to Blog

Machine Learning Inference at Scale

Machine learning projects can be split into two phases:

Training
Inference

During the training phase, data science teams have to obtain, analyze and understand available data and generalize it into a mathematical model. The model uses the features of the sample data to reason about data it has never seen. Although it can be completely custom code, it is usually based on proven machine learning algorithms, such as Naïve Bayes, K Means, Linear Regression, Deep Learning, Random Forests or Decision Trees. The act of building the model from the sample (training) data is referred to as training.

The machine learning inference phase refers to using the model to predict an unknown property of the input data. This requires deploying the model into a production environment and operating it.

Operating Machine Learning

The most straightforward way to deploy the model is to wrap it in a REST web service and let other applications remotely invoke the inference service. Many machine learning frameworks provide such a service out-of-the-box to support simple deployments that don’t deal with much data.

What Hazelcast Jet adds to this story is a simple way to deploy the model so that it is automatically parallelized and scaled out across a cluster of machines.

Hazelcast Jet cluster inference — Hazelcast Jet uses its parallel, distributed and resilient execution engine to turn the model into a high-performance inference service.

Jet uses its parallel, distributed and resilient execution engine to turn the model into a high-performance inference service. To use all available CPU cores, Jet spins up multiple parallel instances of the model and spreads the inference requests among them. The Jet cluster is elastic; to scale with the workload, add or remove cluster members on the fly with no downtime.

Another trick is using a pipelined design instead of a request-reply pattern. It allows Jet to batch inference requests together and reduce fixed overheads of serving each request individually. This improves the overall throughput of the model significantly! The pipelined design requires a change in the client’s workflow. Instead of calling the inference service directly, it sends its inference request to an inbox. It may be implemented using a message broker such as JMS topic, Kafka or distributed topic of Hazelcast. Jet watches the inbox and groups multiple requests together to use the model service efficiently. It uses smart batching where the batch size changes with the data volume to keep the latency always low. The inference results are published to an outbox for callers to pick it up.

Models Supported

Python models

Python is the lingua franca of the data science world. There is a wide ecosystem of libraries and tools to build and train models in Python: TensorFlow, Keras, Theano, Scikit-learn or PyTorch to name a few. Jet can host any Python model.

Upon model deployment, Jet’s JVM runtime launches Python processes and establishes bi-directional gRPC communication channels to stream inference requests through it. So, the model runs natively in a Python process that is completely managed by Jet. It can be tuned to spin multiple Python processes on each machine to make use of multicore processors.

Jet makes sure that the Python code is distributed to all machines that participate in the cluster. If you add another machine to a Jet cluster, it creates a directory on it and deploys the Python code there. Moreover, Jet can install all required Python libraries to prepare the runtime for your Python model.

Documentation: https://docs.hazelcast.org/docs/jet/latest/manual/#map-using-python

Code sample: https://github.com/hazelcast/hazelcast-jet/tree/master/examples/python

Java models

Java models are used for high-performance inference execution. Favourite Java model libraries include JPMML, TensorFlow for Java, MXNet, XGBoost JVM Package and H2O.

Similarly to Python, the model is packaged as a Jet Job resource. The Job usually includes model inference code (the ML library) and a serialized model. Jet runs the Java models in-process with the cluster members so there is no need to start extra processes and there is no communication overhead (serialization, deserialization, networking). This makes Java model the best performing option. The inference job can be configured to use one model instance per JVM or multiple model instances.

Documentation: https://docs.hazelcast.org/docs/jet/latest/manual/#machine-learning-model-prediction

Code samples:

Remote services

We started this article by saying that using a model as an RPC service is simple but requires extra effort when scaling. Jet supports this pattern, too. The Jet Job can invoke a remote inference service. The model isn’t managed by Jet in this case, so the operational and performance advantages are gone. Jet still provides the convenience of smart batching, inbox/outbox connectors and many pipeline operators. Smart batching works only if the RPC service can operate on batches of input items.

Benefits of this setup

Isolating the model service and the data pipeline
Sharing the model among many Jet pipelines

Code samples:

Execution Mode Overview

Execution Mode	Java Model	Python Model	Remote Model
Model managed by Jet
Model shared between Jobs
Jet ↔ Model Communication	Shared memory	gRPC (processes colocated)	Usually gRPC or REST (processes on different machines)
Throughput (single node)	1M / sec	50k / sec	Depends
Requirements	Model runs in JVM	Python runtime installed on all cluster machines	Model service started prior to the Jet Job Model available as a RPC service

Framework Integration Overview

Framework	Execution Mode	Code Sample
H2O	Java	Code Sample
TensorFlow for Java	Java	Code Sample
Custom Java Model	Java	Code Sample
PMML	Java	N/A, use JPMML Evaluator as a Custom Java Model
MXNet	Java	N/A, use MXNet Java Inference API as a Custom Java Model
XGBoost	Java	N/A, use XGBoost JVM Package as a Custom Java Model
Keras, Theano, Scikit-learn or PyTorch	Python	N/A, use the Custom Python Model
Custom Python Model	Python	Code Sample
Remote gRPC service	Remote	Code Sample
Remote TensorFlow service	Remote	Code Sample

Keep Reading

Webinar

/ Video

/ 60 min

Key Considerations for Optimal Machine Learning Deployments

Machine learning (ML) is being used almost everywhere, but the ubiquity has not been equated with simplicity. If you solely consider the operationalization aspect of ML, you know that deploying your models into production, especially in real-time environments, can be inefficient and time-consuming. Common approaches may not perform and scale to the levels needed. These challenges are especially true for businesses that have not properly planned out their data science initiatives.

Webinar

/ Video

/ 60 min

Operationalizing Machine Learning with Java Microservices and Stream Processing

Are you ready to take your algorithms to the next steps and get them working on real-world data in real-time? We will walk through an architecture for taking a machine learning model into deployment for inference within an open source platform designed for extremely high throughput and low latency.

Webinar

/ Video

/ 60 min

Make Your Data Science Actionable: Real-Time Machine Learning Inference with Stream Processing

Are you ready to make your machine learning algorithms operational within your business in real time? In this webinar, we will walk through an architecture for taking a machine learning model from training to deployment for inference within an Open Source platform for real-time stream processing.

Blog / Jan 31, 2019

Executing Machine Learning at Scale and Speed

One of the current hot topics driving the in-memory domain is the potential associated with the application of Machine Learning…

Platform

Cloud Deployment Options

Key Solutions

By Industry

By Use Case

By Architecture

A cloud-agnostic architecture for your applications

Resource Center

Content Types

Learn

33% Reduction in Operational Costs

Developers

Community

Learn

Toolbox

A cloud-agnostic architecture for your applications

By Vladimir Schreiner

Machine Learning Inference at Scale

Operating Machine Learning

Models Supported

Python models

Java models

Remote services

Execution Mode Overview

Framework Integration Overview

Keep Reading

Key Considerations for Optimal Machine Learning Deployments

Operationalizing Machine Learning with Java Microservices and Stream Processing

Make Your Data Science Actionable: Real-Time Machine Learning Inference with Stream Processing

Executing Machine Learning at Scale and Speed

Why Hazelcast

About Us

Platform

Solutions

Developers

Learn

Connect

Platform

Cloud Deployment Options

Key Solutions

By Industry

By Use Case

By Architecture

A cloud-agnostic architecture for your applications

Resource Center

Content Types

Learn

33% Reduction in Operational Costs

Developers

Community

Learn

Toolbox

A cloud-agnostic architecture for your applications

By Vladimir Schreiner

Spread the Word

Machine Learning Inference at Scale

Operating Machine Learning

Models Supported

Python models

Java models

Remote services

Execution Mode Overview

Framework Integration Overview

Keep Reading

Key Considerations for Optimal Machine Learning Deployments

Operationalizing Machine Learning with Java Microservices and Stream Processing

Make Your Data Science Actionable: Real-Time Machine Learning Inference with Stream Processing

Executing Machine Learning at Scale and Speed

Why Hazelcast

About Us

Platform

Solutions

Developers

Learn

Connect