Micro Batch Processing

Micro-batch processing is the practice of collecting data in small groups (“batches”) for the purposes of taking action on (processing) that data. Contrast this to traditional “batch processing,” which often implies taking action on a large group of data. Micro-batch processing is a variant of traditional batch processing in that the data processing occurs more frequently so that smaller groups of new data are processed. In both micro-batch processing and traditional batch processing, data is collected based on a predetermined threshold or frequency before any processing occurs.

Micro Batch Processing Example

In the traditional world of data, workloads were predominantly batch-oriented. That is, we collected data in groups, then cleansed and transformed the data before loading it to a repository such as a data warehouse to feed canned reports on a daily, weekly, monthly, quarterly, and annual basis. This cycle was sufficient when most companies predominantly interacted with customers in the physical world.

As the world has become more digital and “always-on,” and our interactions continuously generate vast volumes of data, enterprises adopted new technological approaches to build and maintain a competitive advantage by driving more immediate, personalized interactions. These approaches were micro-batch processing, then stream processing.

How does micro-batch processing work?

Micro-batch processing accelerated the cycle so data could be loaded much more frequently, sometimes in increments as small as seconds. Micro-batch loading technologies include Fluentd, Logstash, and Apache Spark Streaming.

Micro-batch processing is very similar to traditional batch processing in that data are usually processed as a group. The primary difference is that the batches are smaller and processed more often. A micro-batch may process data based on some frequency – for example, you could load all new data every two minutes (or two seconds, depending on the processing horsepower available). Or a micro-batch may process data based on some event flag or trigger (the data is greater than 2 megabytes or contains more than a dozen pieces of data, for example).

Though it is not true real-time processing, micro-batch processing initially sufficed to support some “real-time” use cases in which data does not need to be up-to-the-moment accurate. For example, you may have corporate dashboards that refresh every 15 minutes or hourly. Or you may collect server logs in small regular intervals for historical record-keeping rather than for true real-time use cases such as intrusion detection.

Micro-batch processing vs stream processing

The world has accelerated, and there are many use cases for which micro-batch processing is simply not fast enough. Organizations now typically only use micro-batch processing in their applications if they have made architectural decisions that preclude stream processing. For example, an Apache Spark shop may use Spark Streaming, which is – despite its name and use of in-memory compute resources – actually a micro-batch processing extension of the Spark API.

Now, stream processing technologies are becoming the go-to for modern applications. As data has accelerated throughout the past decade, enterprises have turned to real-time processing to respond to data closer to the time at which it is created to solve for a variety of use cases and applications. In some cases, micro-batch processing was “real-time enough,” but increasingly organizations have recognized that stream processing that leverages in-memory technology – either in the cloud or on-premises – is the ideal solution.

Stream processing allows applications to respond to new data events at the moment they are created. Rather than grouping data and collecting it at some predetermined interval, stream processing systems collect and process data immediately as they are generated.

Stream Processing Example Use Cases

While the most aggressive of developers measure the performance of micro-batch processing technologies in milliseconds – for example, the recommended lower end of Spark Streaming batch intervals is 50 milliseconds because of associated overhead costs – developers measure the performance of stream processing in single-digit milliseconds. For example, in this blog post on streaming technology for retail banks, the author explains how a major credit card processor executes a core data collection and processing workflow for fraud detection in 7 milliseconds.

Applications that incorporate stream processing can respond to new data immediately as it is generated rather than after data crosses a predefined threshold. This is a critical capability for use cases such as payment processing, fraud detection, IoT anomaly detection, security, alerting, real-time marketing, and recommendations. If your use case either includes data that is most valuable at the time it is created or requires a response (i.e. data collection and processing) at the time the data is generated, stream processing is right for you.

Consider when you swipe your credit card at the checkout line in your local grocery. Your expectation, or at least hope, is that the payment processing system will immediately confirm that you have not reported your credit card to be lost or stolen and the transaction will be authorized, with an allowance for the fact that the majority of the time the data is in transit across a network and back. This requires stream processing. Now, imagine you are the last customer in the store and you swipe your card, but the payment system is built on a micro-batch processing system with batches set to process when there are 10 new transactions. You would wait longer than necessary.

Stream processing technologies such as Hazelcast Jet with Hazelcast IMDG (and as a managed service as Hazelcast Cloud) enable you to collect, process and analyze data in real-time at scale as it is generated.

Related Topics

Stream Processing

Real-time Stream Processing

Event Stream Processing

Directed Acyclic Graph (DAG)

Further Reading

The Role of Streaming Technology in Retail Banking

Relevant Resources

| Video
| 60 minutes

Should You Adopt a Streaming-First Architecture?

Traditional data warehousing has long used batch jobs to move, load and transform data for decision making. But as data volumes rise and the velocity of business grows, more organizations are opting to move and process data in real-time or near real-time. Batch processing is giving way to mini-batches fueled by replication and change data capture as well as stream processing in which events are captured, processed, and analyzed as they happen.

Many companies today have a mix of batch, mini-batch, and stream-based processing. The question is whether organizations should embrace streaming as the default mode of data acquisition? Several vendors are now pitching streaming-first architectures and extolling the benefits of processing data in real-time. This webinar will explore the pros and cons of a streaming-first architecture and examine industry trends in its adoption.

| Video
| 60 minutes

Enriching Data Streams with Hazelcast Jet

Enrichment is a frequent technical use case in stream processing. It is a translation of the traditional star schema into the low-latency continuous processing world: the stream of facts is enriched using slowly changing dimension data.

In this webinar you will learn how to do high-performance stream enrichment. We’ll discuss multiple ways of enrichment, explaining the trade-offs. We will feature hands-on examples and live coding using Hazelcast Jet 0.7.

View All Resources