Tech Talk: Machine Learning at Scale Using Distributed Stream Processing
The capabilities of machine learning are now pretty well understood and there are great tools to do data science and construct models that answer nontrivial questions about your data. These tools are mostly used from Python.
The key new challenge is making the trained prediction model usable in real-time, while the user is interacting with your software. Getting answers from an ML model (this is called inference) takes a lot of CPU and must be done at serious scale. The ML tools are optimized mainly for batch-processing a lot of data at once, and often the implementations aren’t parallelized.
In this talk, I will show one approach which allows you to write a low-latency, auto-parallelized and distributed stream processing pipeline in Java that seamlessly integrates with a data scientist’s work taken in almost unchanged form from their Python development environment.
The talk includes a live demo using the command line and going through some Python and Java code snippets.