Open Source Projects:

Use Case

Fast Batch Processing

Pricing
Chat
Contact
Back to top
Overview

Distributed, in-memory, parallel batch processing for speed

In batch processing, a person or application regularly launches a processing job against a bounded, input data set. Batch processing is often used for tasks such as ETL (extract-transform-load) for populating data warehouses, data mining, and analytics. Some of the most common functions of batch processing are filtering, joining, sorting, grouping, and aggregating data.

Traditionally, developers used specialized ETL tools operating against relational databases. Now, however, it is quite common to see generic open source tools such as Hadoop and Spark used for ETL. Such tools leverage parallel computation against distributed storage, which can offer very high performance for batch processing jobs such as ETL workloads.

Connect to your existing world

In batch processing, the complete data set is assembled and available before a job is submitted for processing. Hazelcast Jet treats batch processing as a specific type of stream processing with a finite source and no windows. As a result, developers can use the same programming interface for both batch and stream processing, making the transition to streaming straightforward.

Hazelcast Jet is a single 15MB Java library with no external dependencies. It runs fast, scales automatically, and handles failures itself without requiring any additional infrastructure. You can fully embed Hazelcast Jet into applications such as data processing microservices, making it easier to build and maintain next-generation systems. Or you can launch each Hazelcast Jet processing job within its own cluster to maximize service isolation.

Contrast Jet to other popular processing technologies. For example, Hadoop and Spark have many components that require a heavyweight installation and maintenance effort. They are complex to deploy and manage. Developers must select the right modules and maintain their dependencies, creating both development and operational challenges.

Solutions

A choice of three APIs

Hazelcast Jet accelerates batch processing up to 15x compared to Spark or Flink, and Hazelcast Jet outperforms Hadoop by orders of magnitude (See the complete benchmark). Hazelcast Jet achieves this performance through the combination of a directed acyclic graph (DAG) computation model, in-memory processing, data locality, partition mapping affinity, spsc queues, and green threads.

In-Memory Real Time Processing with Hazelcast Jet

Hazelcast Jet source and sink adapters make it easy to insert Hazelcast Jet into the data processing pipeline. Hazelcast Jet includes pre-built connectors for Hazelcast IMDG (specifically for the Map, Cache, and List objects), Hadoop Distributed File System, JDBC, and local data files (e.g., CSV, logs, or Avro files).

When a Hazelcast Jet cluster is co-located with Hazelcast IMDG or HDFS, it makes use of data locality. Hazelcast Jet nodes are able to efficiently read the data by having every node only read from their respective local partitions. You can create your own connectors for integration with databases or enterprise applications.

High-level declarative Java API

The Hazelcast Jet Pipeline API is a general-purpose, declarative API that provides developers with tools to compose fast, distributed, concurrent batch processing jobs from building blocks such as mappers, reducers, filters, aggregators, and joiners. It is simple and easy to understand, as well as powerful. For expert users, Hazelcast Jet provides a Core API, which is an edge- and vertex-level API for fine-grained control of your data pipelines.

Free Hazelcast Online Training Center

Whether you're interested in learning the basics of in-memory systems, or you're looking for advanced, real-world production examples and best practices, we've got you covered.