The Hazelcast Approach to Stream Processing
Hazelcast provides the key components to build a real-time stream processing application. It is a powerful processing framework for querying data streams on top of an elastic in-memory storage system, where the process may ultimately store its results.
How It Works
Hazelcast processing tasks, called jobs, are distributed across the cluster to parallelize the computations. You can elastically and horizontally scale the cluster based on your performance and volume requirements.
For real-time data enrichment, Hazelcast provides a tight integration with in-memory computing to deliver very high-speed data access. You can store large amounts of data in-memory, which are then joined to the data stream with microsecond latency. Moreover, you can reduce the end-to-end latency by using Hazelcast to store temporary data for stateful stream processing tasks.
Stream Processing Challenges
Stream processing presents unique challenges that are not relevant to batch processing frameworks. Below are several key challenges and details on how Hazelcast addresses each of these.
Event Time and Late Events
Hazelcast supports the notion of “event time” in which events can have their own timestamp and may arrive out of order. To handle out-of-order events and especially late-arriving events, stream processors must keep calculations (i.e., aggregations) open until all events have arrived. However, stream processors cannot know if all events have arrived, so they need to discard any extremely late events. To define what constitutes an “extremely late” event, Hazelcast sets a “watermark” that marks a time window in which late-arriving events can still be processed in the appropriate aggregation window. Events arriving from beyond the watermarked time window are discarded.
Streaming Fault Tolerance
Fault tolerance in stream processing systems must deal with preserving data that is not necessarily stored in any permanent medium. This means that stream processors need to know how to handle failures with data-in-motion, or else data can be lost. Hazelcast is fault-tolerant for issues such as network failures, splits, and node failures. When there is a fault, Hazelcast uses the latest state snapshot and automatically restarts all jobs that contain the failed member as a job participant from this snapshot. With in-memory snapshots saved to distributed in-memory storage, Hazelcast resumes processing where it left off. Distributed in-memory storage is an integral component of the cluster. Multiple replicas of data are stored in a distributed manner across the cluster to increase the cluster’s resiliency.
Event processing systems have to balance tradeoffs in performance and correctness, and some systems may not allow firm processing guarantees, which can make it difficult to program these systems.
Hazelcast allows you to choose a processing guarantee when you start a job. While there is some performance tradeoff with the higher guarantees, Hazelcast still provides superior processing speed while honoring the chosen guarantee. Hazelcast provides exactly-once processing (the slowest but most correct), at-least-once processing, or no guarantee of correctness (the fastest option).