Faceoff: Hazelcast Platform vs. Apache Flink

Comparing Hazelcast Platform with Apache Flink feels like an Avengers vs. Superman match!

Stream processing is one of the foundational components of the Hazelcast Platform, which also includes a proven, highly-performant, distributed low latency storage. In comparison, Flink is recognized for its “solo” stream processing framework.

Both Flink and Hazelcast are DAG-based stream processing systems. For the uninitiated, directed acyclic graph (DAG) models the flow of data and is an ideal fit for defining streaming data pipelines. Initially, the streaming data pipelines could be developed in Java, enabling a seamless composition of stateful and stateless operators for standard tasks, such as data filtering and aggregation. As stream processing has gained widespread adoption with data engineers, Flink and Hazelcast have added support for SQL and Python. Behind the scenes, the data pipelines developed in SQL or Python still get translated into a Java-based DAG. The operators in the DAG can be distributed, scaled, and executed across the cluster for high throughput and low latency.

 

Hazelcast_vs_Flink_architecture_diagram

 

 

In brief, Hazelcast and Flink support building data pipelines using Java, SQL, or Python. In a distributed computing environment where one might lose a cluster member or the entire cluster, they support sophisticated checkpointing techniques for state-of-the-art fault tolerance and state management. Both support exactly once processing guarantees ensuring no data loss or data duplication.

Also, both support the Kappa architecture, providing an environment for working with batch and streaming data sources. They provide connectors to common streaming and batch data sources, including Apache Kafka, JDBC-compliant databases, etc. They also have in-build optimizations to deal with datasets that are bounded. The data pipeline can run in parallel and individual operators can be scaled independently to increase throughput. Best of all, these optimizations are automatically performed by the underlying DAG engine.

Now let’s talk about the differences between Apache Flink and Hazelcast Platform.

Example use case:

To make the comparison more tangible, we’ll use a Fraud Management application that monitors transactions for millions of users and applies various complex rules.

Storage for Contextual Data

Rarely do streaming jobs run in isolation. A data pipeline often needs to be enriched with contextual information. Our stream of transactions originating from a point-of-sale (PoS) terminal needs to be enhanced with contextual data, specifically the customer’s profile and account information. Flink needs an external storage for this contextual data. Often Flink is integrated with external data sources such as Redis to provide this storage.

State Management

Another difference between Flink and Hazelcast is around managing state in stream processing. “State” stores the interim results during stateful stream processing. As our fraud management model becomes more complex, we need to compute aggregates and store the interim results for longer periods – such as tracking average transaction amounts on an hourly, daily, and weekly basis – the window will store the “interim” results in “state” and write the results to the “external storage” at the end of the time period.

To support state management, Flink supports HashMapStateBackend and EmbeddedRocksDBStateBackend. The EmbeddedRocksDBStateBackend is encouraged for jobs with very large state, long windows, large key/value states[2]. All reads/writes must go repeatedly through de-/serialization to retrieve/store the state objects. Hazelcast Platform is much simpler as it does not require integration with an external state store, like RocksDB. It eliminates the cost of data de-/serialization as it stores the state data in its fast data store and can also leverage Tiered Storage for storing data in memory and disk.

DIY Stream Processing vs. Unified Real-Time Data Platform

Both Hazelcast Platform and Apache Flink provide sophisticated stream processing capabilities based on event-driven architectures. Both support Java, SQL and Python for developing real-time data pipelines and provide connectors for streaming and batch data sources. Hazelcast’s unified real-time data platform seamlessly integrates stream processing and fast data storage. Flink requires integration with an external data store, whereas Hazelcast Platform provides integrated storage and state management, reducing the cost and complexity of managing a real-time application.

Hazelcast also provides flexibility in its deployment methods. One can embed Hazelcast Platform as a Java library directly into your application. In this mode, Hazelcast runs as a fast data store within the same JVM as your application.

 

Conclusion

With Hazelcast, you can harness the full potential of real-time data without the complexity of integrating multiple software components. Our unified platform handles growth demands, unexpected load spikes, hardware failures of multitudes of components, downtime, and ongoing administrative tasks. What’s more, it integrates with your existing infrastructure, so there’s no need to rip and replace technology to give your applications the ability to act instantly on data in motion.

We look forward to your feedback and comments about this blog post! Share your experience with us in the Hazelcast community Slack and the Hazelcast GitHub repository. Hazelcast also runs a weekly live stream on Twitch, so give us a follow to get notified when we go live.

 

Curious about the performance comparison between Hazelcast and Apache Flink?
Read more in our Billion Events Per Second with Millisecond Latency
Whitepaper.