Blog ›Hazelcast Jet 0.5: Fault Tolerant Stateful Stream Processing Made Easy

By Ensar Basri Kahveci

Distinguished Engineer

Ensar’s primary areas of interest are distributed data, replication, consistency, and storage. He has more than seven years of hands-on expertise in designing, developing, and testing distributed algorithms, with solid experience in concurrency. He has authored a number of articles on distributed data and stream processing, and is a frequent speaker at industry conferences on topics such as replication and distributed systems. Several of his talks can be found on YouTube, including Replication Distilled, Distributed Systems for Mere Mortals, and Replication in the Wild. Ensar is a Ph.D. candidate in computer science at Bilkent University in Ankara, Turkey.

View all blogs by the author

Dec 6, 2017

Back to Blog

Hazelcast Jet 0.5: Fault Tolerant Stateful Stream Processing Made Easy

Stream processing is an emerging computational paradigm for on-the-fly processing of live data feeds, targeting low latency and high throughput. Streaming applications are usually deployed on multiple servers to achieve these requirements. Since even a single failure may lead to incorrect results or long interruptions in result delivery, fault tolerance is of paramount importance in such long-running distributed applications.

With the new 0.5 release, Hazelcast Jet capitalizes on the Hazelcast IMDG platform to simplify fault-tolerant execution of streaming computations. In this blog post, we briefly describe the new high availability and fault tolerance features of Hazelcast Jet.

Highly Available Streaming Computations

Stream processing clusters contain components with heterogeneous roles, wherein a master node coordinates the execution of submitted jobs and possibly multiple worker nodes perform the actual execution. In this setting, a number of availability concerns arise for different components. For instance, a set of master nodes are deployed to eliminate single point of failure. One of these nodes is elected as leader and other nodes wait in standby until the failure of the leader. Relatedly, multiple worker nodes are run for parallel and distributed execution of the jobs. In case of a worker failure, its workload is re-assigned to a standby worker or distributed to other active workers.

In this model, stream processing engines generally depend on external coordination services, such as Apache ZooKeeper, for leader election, and replication of cluster and job information. Although coordination services are a good fit for such use cases, they bring additional complexity into the system. Hazelcast Jet frees the developers from the burden of configuring and maintaining coordination services for different environments and deployment settings. Hazelcast Jet does not depend on any external service and maintains job information in internal Hazelcast IMDG data structures, namely IMap instances. When a job is submitted, the job information is distributed between a set of IMap instances. Since the IMap data structure replicates data entries to multiple servers, Hazelcast Jet enables fault tolerance for job information. The presence of the master node is not visible to the end users. Users do not designate any Hazelcast Jet node for job coordination. Hazelcast Jet clusters coordinate execution of the submitted jobs transparently and automatically. Internally, a single Hazelcast Jet node, called coordinator, takes the responsibility of coordinating job executions. If the coordinator node fails for any reason, another node is automatically promoted. Relatedly, jobs are run on multiple nodes. If a node fails during execution, the job is automatically restarted on the remaining nodes.

So with these mechanisms, we transparently handle job execution and worker failover. But what about data processing guarantees?

Enabling Fault Tolerant Stateful Computations

Hazelcast Jet realizes fault tolerant stateful streaming computations with a new snapshotting mechanism. The new mechanism takes a snapshot of the state of a running streaming job at regular intervals. A snapshot contains a consistent view of the state of all processors at some point in time. In case of a node failure, the coordinator stops the current execution and restarts the job with the latest successful snapshot. When a job is restarted from a snapshot, the nodes recover processor state efficiently from local memory. The processors are initialized with the state recovered from the snapshot and the input streams are rolled back. The snapshotting feature is an efficient implementation of Lamport and Chandy’s “Distributed Snapshots: Determining Global States of a Distributed System” paper.

New methods were added to the Processor API for snapshotting. Upon a snapshot persistence call, a processor passes its internal state to the engine as key-value objects. Then, the processor state is put into internal IMap data structures as part of the snapshot. The new fault tolerance mechanism provides a choice between, none, exactly-once and at-least-once semantics for event processing. Source and sink processors must cooperate with the snapshotting mechanism in order to achieve the exactly-once semantics.

Conclusions

The 0.5 release of Hazelcast Jet opens new doors for highly available and fault tolerant streaming systems. Hazelcast Jet does not depend on any other external storage system for high availability and snapshotting. Additionally, in-memory snapshots pave the way for minimally disruptive snapshotting and fast recovery after failures. For the next release, we will be working on improvements and optimizations in the snapshotting mechanism.

Keep Reading

Blog

Decisions at the Speed of Memory: Hazelcast on IBM® LinuxONE 5

Network hops cost milliseconds; milliseconds cost money. Put data, compute, and AI on one platform, and both bills shrink, whether…

Blog

Understanding the Value of Distributed Compute

Introduction Hazelcast is a powerful platform. It delivers the power of a highly reliable, distributed cache. Equally important is the…

Blog

Resilience That Holds Under Load: Hazelcast Platform 5.7

A major release for institutions where the operational state must remain correct during degradation, not just be restored afterward. The…

Blog

Testing distributed resilient applications powered by Hazelcast

Applications powered by Hazelcast and that use it to drive business logic need tests that go beyond happy-path validation. Serialization,…

Datasheet

/ PDF

/ 2 pages

Resilient, Continuous, Active Data – without Compromise Datasheet

The unified in-memory and stream processing platform for resilient, continuous active data at sub-millisecond speed.

Webinar

/ Video

/ 45 min

Zero Downtime, Real Pain: Schema Evolution in Cached, Live Systems

Zero-downtime upgrades aren’t the hard part—schema evolution is. Learn how mixed service versions interact with shared cached data, why subtle inconsistencies cause failures, and how to design forward-compatible changes using Hazelcast and real Java examples.

Platform

Cloud Deployment Options

Key Solutions

By Industry

By Use Case

By Architecture

A cloud-agnostic architecture for your applications

Resource Center

Content Types

Learn

33% Reduction in Operational Costs

Developers

Community

Learn

Toolbox

A cloud-agnostic architecture for your applications

By Ensar Basri Kahveci

Hazelcast Jet 0.5: Fault Tolerant Stateful Stream Processing Made Easy

Highly Available Streaming Computations

Enabling Fault Tolerant Stateful Computations

Conclusions

Keep Reading

Decisions at the Speed of Memory: Hazelcast on IBM® LinuxONE 5

Understanding the Value of Distributed Compute

Resilience That Holds Under Load: Hazelcast Platform 5.7

Testing distributed resilient applications powered by Hazelcast

Resilient, Continuous, Active Data – without Compromise Datasheet

Zero Downtime, Real Pain: Schema Evolution in Cached, Live Systems

Why Hazelcast

About Us

Platform

Solutions

Developers

Learn

Connect

Platform

Cloud Deployment Options

Key Solutions

By Industry

By Use Case

By Architecture

A cloud-agnostic architecture for your applications

Resource Center

Content Types

Learn

33% Reduction in Operational Costs

Developers

Community

Learn

Toolbox

A cloud-agnostic architecture for your applications

By Ensar Basri Kahveci

Spread the Word

Hazelcast Jet 0.5: Fault Tolerant Stateful Stream Processing Made Easy

Highly Available Streaming Computations

Enabling Fault Tolerant Stateful Computations

Conclusions

Keep Reading

Decisions at the Speed of Memory: Hazelcast on IBM® LinuxONE 5

Understanding the Value of Distributed Compute

Resilience That Holds Under Load: Hazelcast Platform 5.7

Testing distributed resilient applications powered by Hazelcast

Resilient, Continuous, Active Data – without Compromise Datasheet

Zero Downtime, Real Pain: Schema Evolution in Cached, Live Systems

Why Hazelcast

About Us

Platform

Solutions

Developers

Learn

Connect