Real-Time Stream Processing
What is Stream Processing?
Stream processing is a technique to process the data on-the-fly, prior to it’s storage. This is in contrast with traditional batch approach, where the data set has to be completely available and stored in the database or file before the processing starts.
This approach is vital when the value of information contained in the data decreases rapidly as the data ages. The faster the information is extracted from the data and provided to consumers the better. Typical application use cases include:
- Log analysis and monitoring
- Fraud detection
- Anomaly detection (IoT systems, sensors)
- Fast business insights
- Cleaning the data for downstream processing (filtering, modifying, normalising, enriching)
- Real-time ad placement and reporting
- Real-time recommendations
- Online gaming stats
- Payment Processing
In these use cases, processing data fast is of the same importance as processing vast volumes of data.
Data streams are potentially unbounded and infinite sequences of records, and records usually represent events or changes that happen in time. Stream processing applications are observing flowing records and literally query the stream for relevant data in near real-time.
Typical stream processing tasks:
- Algorithmic analysis of the stream data
- Joining multiple streams
- Enriching stream with other information
- Implementing Event Sourcing and Command Query Responsibility Segregation Architectures
- Moving batch tasks to near real-time
Hazelcast Jet and Stream Processing
Hazelcast Jet provides the tooling necessary to build a streaming application. It gives you a powerful processing framework to query the data stream and elastic in-memory storage to store the results of the computation.
Jet processing tasks, called Jobs, are distributed across the Jet cluster to parallelize the computation. Jet is able to scale out this way to process large data volumes.
For high speed enrichment, Jet has very high speed integration with Hazelcast IMDG. You can store large amounts of data which are then joined to the Jet stream with microsecond time. Moreover, the end-to-end latency can be consequently lowered by using Hazelcast IMDG for stream ingestion or for publishing the results.
The Challenges of Stream Processing
Dealing with streaming data is fundamentally different than batch or micro-batch processing as both input and output is continuous. Most streaming computations also deal with some notion of time where you are interested in how a value changes over time. The typical way to deal with streaming data is to look at it in terms of “windows”, where a window represents a slice of the data stream, usually constrained for a period of time.
Jet supports Tumbling, Sliding and Sessions Windows.
Event Time and Late Events
Jet also supports the notion of “event-time” where events can have their own timestamp and can arrive out of order. This is achieved by inserting watermarks into the stream of events which drive the passage of time forward.
Fault tolerance is an important concept in stream processing where jobs are run without a definite end and node failures can cause disruption. Jet introduces a simple way to do fault tolerant streaming computation without relying on any external system or storage, and instead using the distributed in-memory storage provided by Hazelcast.
Jet jobs are restarted automatically when a node leaves the cluster. Using in-memory snapshots, processing can be resumed where it left off.
Because of the need to trade-off performance and correctness, event processing systems may not allow firm guarantees which can make it harder to program these systems.
Jet allows you to choose the processing guarantee at the time you start the job, choosing between the following, from fastest to slowest:
- No guarantee