Distributed, in-memory, parallel batch processing for speed
In batch processing, a person or application regularly launches a processing job against a bounded, input data set. Batch processing is often used for tasks such as ETL (extract-transform-load) for populating data warehouses, data mining, and analytics. Some of the most common functions of batch processing are filtering, joining, sorting, grouping, and aggregating data.
Traditionally, developers used specialized ETL tools operating against relational databases. Open source tools such as Hadoop and Spark made large-scale batch processing more mainstream. Such tools leverage parallel computation on distributed storage to efficiently process data. Hadoop uses an older processing paradigm called MapReduce, while Spark uses directed acyclic graphs (DAGs) to coordinate the processing. Hazelcast also uses DAGs, but adds in-memory speeds to complete the work much more quickly.
Connect to your existing world
Hazelcast treats batch processing as a specific type of stream processing with a finite source and no time windows. Since Hazelcast was built for high-speed processing, that same performance advantage can be applied to batched data. As a result, developers can use the same programming interface for both batch and stream processing, making the transition to streaming straightforward.
Hazelcast runs fast, scales automatically, and handles failures itself without requiring any additional infrastructure. You can fully embed Hazelcast into applications such as data processing microservices, making it easier to build and maintain next-generation systems. Or you can launch each Hazelcast processing job within its own cluster to maximize service isolation.
Contrast Hazelcast to other popular processing technologies. For example, Hadoop and Spark have many components that require a heavyweight installation and maintenance effort. They are complex to deploy and manage. Developers must select the right modules and maintain their dependencies, creating both development and operational challenges.
Solution: The Hazelcast Platform
Hazelcast significantly accelerates batch processing compared to other processing frameworks. Our benchmarks show that Hazelcast delivers extreme speed with extreme efficiency, processing 1 billion events per second on far fewer hardware resources than other technologies. Hazelcast achieves this performance through the combination of a highly optimized directed acyclic graph (DAG) computation model, in-memory processing, data locality, partition mapping affinity, queues, and green threads.
Hazelcast source and sink adapters make it easy to insert Hazelcast into the data processing pipeline. Hazelcast includes pre-built connectors to the in-memory data store (specifically for the Map, Cache, and List objects), Hadoop Distributed File System, JDBC, and local data files (e.g., logs, or Avro files).You can create your own connectors for integration with databases or enterprise applications.You can create your own connectors for integration with databases or enterprise applications.
When a Hazelcast cluster leverages its in-memory data, or is colocated with data stores like HDFS, it makes use of data locality. This means it can process data residing locally on the same hardware server, which improves performance by eliminating network reads/writes to other servers. Other Hazelcast nodes will process their respective local data, so that all nodes in a Hazelcast cluster work together to process the data in parallel.
High-level declarative Java API
The Hazelcast Pipeline API is a general-purpose, declarative API that provides developers with tools to compose fast, distributed, concurrent batch processing jobs from building blocks such as mappers, reducers, filters, aggregators, and joiners. It is simple and easy to understand, as well as powerful. For expert users, Hazelcast provides a Core API, which is an edge- and vertex-level API for fine-grained control of your data pipelines.