Fast Batch Processing

See Hazelcast in Action

Master intelligent applications with Hazelcast unified real-time stream processing platform.

Introduction

Distributed, in-memory, parallel batch processing for speed

In batch processing, a person or application regularly launches a processing job against a bounded, input data set. Batch processing is often used for tasks such as ETL (extract-transform-load) for populating data warehouses, data mining, and analytics. Some of the most common functions of batch processing are filtering, joining, sorting, grouping, and aggregating data.

Traditionally, developers used specialized ETL tools operating against relational databases. Open source tools such as Hadoop and Spark made large-scale batch processing more mainstream. Such tools leverage parallel computation on distributed storage to efficiently process data. Hadoop uses an older processing paradigm called MapReduce, while Spark uses directed acyclic graphs (DAGs) to coordinate the processing. Hazelcast also uses DAGs, but adds in-memory speeds to complete the work much more quickly.

Connect to your Existing World

Hazelcast treats batch processing as a specific type of stream processing with a finite source and no time windows. Since Hazelcast was built for high-speed processing, that same performance advantage can be applied to batched data. As a result, developers can use the same programming interface for both batch and stream processing, making the transition to streaming straightforward.

Hazelcast runs fast, scales automatically, and handles failures itself without requiring any additional infrastructure. You can fully embed Hazelcast into applications such as data processing microservices, making it easier to build and maintain next-generation systems. Or you can launch each Hazelcast processing job within its own cluster to maximize service isolation.

Contrast Hazelcast to other popular processing technologies. For example, Hadoop and Spark have many components that require a heavyweight installation and maintenance effort. They are complex to deploy and manage. Developers must select the right modules and maintain their dependencies, creating both development and operational challenges.

Solution: The Hazelcast Platform

Hazelcast significantly accelerates batch processing compared to other processing frameworks. Our benchmarks show that Hazelcast delivers extreme speed with extreme efficiency, processing 1 billion events per second on far fewer hardware resources than other technologies. Hazelcast achieves this performance through the combination of a highly optimized directed acyclic graph (DAG) computation model, in-memory processing, data locality, partition mapping affinity, queues, and green threads.

High-level declarative Java API

The Hazelcast Pipeline API is a general-purpose, declarative API that provides developers with tools to compose fast, distributed, concurrent batch processing jobs from building blocks such as mappers, reducers, filters, aggregators, and joiners. It is simple and easy to understand, as well as powerful. For expert users, Hazelcast provides a Core API, which is an edge- and vertex-level API for fine-grained control of your data pipelines.