What is Stream Processing?

Stream processing is the practice of taking action on a series of data at the time the data is created. Historically, data practitioners used “real-time processing” to talk generally about data that was processed as frequently as necessary for a particular use case. But with the advent and adoption of stream processing technologies and frameworks, coupled with decreasing prices for RAM, “stream processing” is used in a more specific manner.

Stream processing often entails multiple tasks on the incoming series of data (the “data stream”), which can be performed serially, in parallel, or both. This workflow is referred to as a stream processing pipeline, which includes the generation of the stream data, the processing of the data, and the delivery of the data to a final location.

Actions that stream processing takes on data include aggregations (e.g., calculations such as sum, mean, standard deviation), analytics (e.g., predicting a future event based on patterns in the data), transformations (e.g., changing a number into a date format), enrichment (e.g., combining the data point with other data sources to create more context and meaning), and ingestion (e.g., inserting the data into a database).

Input data enters the stream processing engine, then outputs to the application.
Stream processing allows applications to respond to new data events at the moment they occur. In this simplified example, input data pipeline is processed by the stream processing engine in real-time. The output data is delivered to a streaming analytics application and added to the output stream.

Stream processing vs batch processing

Historically, data was typically processed in batches based on a schedule or some predefined threshold (e.g. every night at 1 am, every hundred rows, or every time the volume reaches two megabytes). But the pace of data has accelerated and volumes have ballooned, and there are many use cases for which batch processing simply doesn’t cut it.

Stream processing has become a must-have for modern applications. Enterprises have turned to technologies that respond to data at the time at which it is created for a variety of use cases and applications, examples of which we’ll cover below.

Stream processing allows applications to respond to new data events at the moment they occur. Rather than grouping data and collecting it at some predetermined interval, a la batch processing, stream processing applications collect and process data immediately as they are generated.

How does it work?

Stream processing is most often applied to data that is generated as a series of events, such as data from IoT sensors, payment processing systems, and server and application logs. Common paradigms include publisher/subscriber (commonly referred to as pub/sub) and source/sink. Data and events are generated by a publisher or source and delivered to a stream processing application, where the data may be augmented, tested against fraud detection algorithms, or otherwise transformed, before the application sends the result to a subscriber or sink. On the technical side, common sources and sinks include Apache Kafka®, big data repositories such as Hadoop, TCP sockets, and in-memory data grids such as Hazelcast IMDG.

Example use cases

Use cases typically involve event data that is generated by some action and upon which some action should immediately occur. Common use cases for real-time stream processing include:

  • Real-time fraud and anomaly detection. One of the world’s largest credit card providers has been able to reduce their fraud write-downs by $800M per year, thanks to fraud and anomaly detection powered by stream processing. Credit card processing delays are detrimental to the experience of both the end customer and the store attempting to process the credit card (and any other customers in line). Historically, credit card providers performed their time-consuming fraud detection processes in a batch manner post-transaction. With stream processing, as soon as you swipe your card, they are able to run thorough algorithms to recognize and block fraudulent charges and trigger alerts for anomalous charges that merit additional inspection, without making their (non-fraudulent) customers wait.
  • Internet of Things (IoT) edge analytics. Companies in manufacturing, oil and gas, and transportation, as well as those architecting smart cities and smart buildings, leverage stream processing to keep up with data from billions of “things.” An example of IoT data analysis is detecting anomalies in manufacturing that indicate problems need to get fixed to improve operations and increase yields. With real-time stream processing, a manufacturer may recognize that a production line is turning out too many anomalies as it is occurring (as opposed to finding an entire bad batch after the day’s shift). They can recognize huge savings and prevent massive waste by pausing the line for immediate repairs.
  • Real-time personalization, marketing, and advertising. With real-time stream processing, companies can deliver personalized, contextual experiences for their customers. This can include a discount for something you added to a cart on a website but didn’t immediately purchase, a recommendation to connect with a just-registered friend on a social media site, or an advertisement for a product similar to the one you just viewed.

Related Topics

Event Stream Processing

Real-time Stream Processing

Micro-batch Processing

Further Reading

Why Is Stream Processing Important to Your Business?

Hazelcast Jet Stream Processing Framework



Relevant Resources

| 13 pages

A Reference Guide to Stream Processing

This paper is intended for software architects and developers who are planning or building system utilizing stream processing, fast batch processing, data processing microservices or distributed java.util.stream.While quite simple and robust, the batching approach clearly introduces a large latency between gathering the data and being ready to act upon it. The goal of stream processing is to overcome this latency. It processes the live, raw data immediately as it arrives and meets the challenges of incremental processing, scalability and fault tolerance.
View All Resources