Use Case

Fast Batch Processing

Request a demo
Chat
Contact
Overview

Distributed, in-memory, parallel batch processing for speed

In batch processing, a person or application regularly launches a processing job against a bounded, input data set. Batch processing is often used for tasks such as ETL (extract-transform-load) for populating data warehouses, data mining, and analytics. Some of the most common functions of batch processing are filtering, joining, sorting, grouping, and aggregating data.

Traditionally, developers used specialized ETL tools operating against relational databases. Now, however, it is quite common to see generic open source tools such as Hadoop and Spark used for ETL. Such tools leverage parallel computation against distributed storage, which can offer very high performance for batch processing jobs such as ETL workloads.

Connect to your existing world

In batch processing, the complete data set is assembled and available before a job is submitted for processing. Hazelcast treats batch processing as a specific type of stream processing with a finite source and no windows. As a result, developers can use the same programming interface for both batch and stream processing, making the transition to streaming straightforward.

Hazelcast is a single 15MB Java library with no external dependencies. It runs fast, scales automatically, and handles failures itself without requiring any additional infrastructure. You can fully embed Hazelcast into applications such as data processing microservices, making it easier to build and maintain next-generation systems. Or you can launch each Hazelcast processing job within its own cluster to maximize service isolation.

Contrast Hazelcast to other popular processing technologies. For example, Hadoop and Spark have many components that require a heavyweight installation and maintenance effort. They are complex to deploy and manage. Developers must select the right modules and maintain their dependencies, creating both development and operational challenges.

Solutions

A choice of three APIs

Hazelcast accelerates batch processing up to 15x compared to Spark or Flink, and Hazelcast outperforms Hadoop by orders of magnitude (See the complete benchmark). Hazelcast achieves this performance through the combination of a directed acyclic graph (DAG) computation model, in-memory processing, data locality, partition mapping affinity, spsc queues, and green threads.

Hazelcast source and sink adapters make it easy to insert Hazelcast into the data processing pipeline. Hazelcast includes pre-built connectors for Hazelcast IMDG (specifically for the Map, Cache, and List objects), Hadoop Distributed File System, JDBC, and local data files (e.g., CSV, logs, or Avro files).

When a Hazelcast cluster leverages its in-memory data, or is colocated with data stores like HDFS, it makes use of data locality. Hazelcast nodes are able to efficiently read the data by having every node only read from their respective local partitions. You can create your own connectors for integration with databases or enterprise applications.

High-level declarative Java API

The Hazelcast Pipeline API is a general-purpose, declarative API that provides developers with tools to compose fast, distributed, concurrent batch processing jobs from building blocks such as mappers, reducers, filters, aggregators, and joiners. It is simple and easy to understand, as well as powerful. For expert users, Hazelcast provides a Core API, which is an edge- and vertex-level API for fine-grained control of your data pipelines.

Free Hazelcast Online Training Center

Whether you're interested in learning the basics of in-memory systems, or you're looking for advanced, real-world production examples and best practices, we've got you covered.