Hazelcast Platform Features
|DISTRIBUTED COMPUTATION||Hazelcast Platform - Enterprise Edition|
Pipeline API is the primary API of Hazelcast for processing both bounded and unbounded data. Use it to declare the data processing pipelines by composing high-level operations (such as map, filter, group, join, window) on a stream of records. The Pipeline API is a Java 8 API with static type safety.
Core API is a low-level API that directly exposes the computation engine’s raw features (DAGs, partitioning schemes, vertex parallelism, distributed vs. local edges, etc.).
Hazelcast is built on top of a low latency streaming core. This refers to processing the incoming records as soon as possible as opposed to accumulating the records into micro-batches before processing.
As data streams are unbounded and there is the possibility of infinite sequences of records, a tool is required to group individual records to finite frames in order to run the computation. Hazelcast windows provides a tool to define the grouping.
Types of windows supported by Hazelcast:
Event Time Processing
Hazelcast allows you to classify records in a data stream based on the timestamp embedded in each record — the event time. Event time processing is a natural requirement as users are mostly interested in handling the data based on the time that the event originated (the event time). Event time processing is a first-class citizen in Hazelcast.
For handling late events, there is a set of policies to determine whether the event is “still on time” or “late”, which results in the discarding of the latter.
Handling Back Pressure
In the streaming system, it is necessary to control the flow of messages. The consumer cannot be flooded by more messages than it can process in a fixed amount of time. On the other hand, the processors should not be left idle wasting resources.
Hazelcast comes with a mechanism to handle back pressure. Every part of the Hazelcast job keeps signaling to all the upstream producers how much free capacity it has. This information is naturally propagated upstream to keep the system balanced.
Exactly-Once or At-Least Once
Hazelcast supports distributed state snapshots. Snapshots are periodically created to back up the running state. Periodic snapshots are used as a consistent point of recovery for failures. Snapshots are also taken and used for up-scaling.
For snapshot creation, exactly-once or at-least-once semantics can be used. This is a trade-off between correctness and performance. It is configured per job.
Hazelcast ensures exactly-once semantics when a replayable source (e.g., Kafka) is used with an idempotent sink (e.g., any store with upsert functionality).
Two-Phase Commit for Exactly-Once
Hazelcast supports the two-phase commit protocol to enable exactly-once guarantees on more types of sources and sinks to participate in transaction-based streaming.
This capability currently adds JMS as a source, and both Kafka and files as sinks, with more planned.
Hazelcast is able to tolerate faults such as network failure, split or node failure, with its redundancy in the cluster. This is just one of the business continuity capabilities in Hazelcast that let customers achieve 99.999% uptime, with many customers reporting zero downtime in production.
When there is a fault, Hazelcast uses the latest state snapshot and automatically restarts all jobs that contain the failed member as a job participant from this snapshot.
Resilient Snapshot Storage
Hazelcast uses the distributed in-memory storage to store the snapshots. This storage is an integral component of the Hazelcast cluster, no further infrastructure is necessary. Data is stored in multiple replicas distributed across the cluster to increase the resiliency.
In-Memory Data Store Integration
Hazelcast has an integrated, elastic in-memory data store, to provide a highly optimized read and write to distributed, in-memory implementations of java.util.Map, java.util.Cache and java.util.List. The in-memory store is to be used for:
Embedded In-Memory Storage
The Hazelcast in-memory data store is embedded. So, all the services of the in-memory store are available to your Hazelcast jobs without any additional deployment effort.
To isolate the processing from the storage, you can still make use of Hazelcast processing in a separate cluster reading from or writing to remote Hazelcast in-memory-only clusters.
Streaming from In-Memory Data Store
In Hazelcast, a connector is included which allows the user to process streams of changes (Event Journal) of an IMap and ICache, enabling developers to stream process IMap/ICache data or to use the Hazelcast in-memory data store as storage for data ingestion.
Hazelcast In-Memory Data Structures
High-performance readers and writers for Hazelcast IMap, ICache and IList. The IMap and ICache are partitioned and distributed. Hazelcast makes use of data locality reading the data from the same node to prevent network transit penalty.
The streaming connectors for IMap and ICache allow the user to treat the Hazelcast distributed map itself as a streaming source, where an event is created for every change that happens on the map. This allows the map to be used as a source of events during a streaming job.
Hazelcast utilizes message brokers for ingesting data streams and it is able to work as a data processor connected to a message broker in the data pipeline.
Hazelcast comes with a Kafka connector for reading from and writing to the Kafka topics.
Java Messaging Services is a traditional means for implementing an enterprise integration. Hazelcast JMS connector allows you to stream messages from/to a JMS queue or a JMS topic using a JMS Client on a classpath (such as ActiveMQ or RabbitMQ). Reading from the queue can be parallelized for higher throughput.
Hazelcast JDBC connector can be used to read or write the data from/to relational databases or another source that supports the standard JDBC API. It’s a batch connector that executes a SQL query and sends the result to the Hazelcast pipeline. It supports parallel reading for partitioned sources.
Hadoop Distributed File System (HDFS) is a common file system used for building large, low cost data warehouses and data lakes. Hazelcast can use HDFS as either a data source or data sink. If Hazelcast and HDFS clusters are co-located, then Hazelcast benefits from the data locality and processes the data from the same node without incurring network transit latency penalty.
Hazelcast can read and write Avro-serialized data from the self-contained files (Avro Object Container format), HDFS and Kafka. A Kafka connector can be configured to use the schema registry.
Local Data Files
Hazelcast Hazelcast comes with batch and streaming file readers to process local data (e.g. CSVs or logs). The batch reader processes lines from a file or directory. The streamer watches the file or directory for changes, streaming the new lines to Hazelcast.
The socket connector allows Hazelcast jobs to read text data streams from the socket. Every line is processed as one record.
Custom Sources and Sinks
Hazelcast provides a flexible API that makes it easy to implement your own custom sources and sinks. Here are the code samples to be used as a template.
Kafka Connect Modules
Hazelcast supports the use of any Kafka Connect module without the presence of a Kafka cluster. This adds more sources and sinks to the Hazelcast ecosystem. This feature includes full support for fault-tolerance and replaying.
Change Data Capture Integration with Databases
Hazelcast supports change data capture through integration with Debezium and Striim. Debezium provides CDC integration with SQL Server, MySQL, PostgreSQL, MongoDB, and Cassandra. Striim provides CDC integration with Oracle.
Management Center enables you to monitor and manage your Hazelcast cluster. In addition to monitoring the overall health of your cluster, you can also analyze the data flow of the distributed pipelines. Management Center provides visual tools to inspect running jobs and detect potential bottlenecks.
Crucially, developers can observe clusters in real-time and gain far more insight into what is occurring “under the hood”.
Hazelcast is elastic — it is able to dynamically re-scale to adapt to workload changes.
When the cluster extends or shrinks, running jobs can be automatically replanned to make use of all available resources.
Lossless Cluster Restart
Hazelcast uses persistence to back up its state snapshots regularly. Jobs, Job State, Job Configuration is configured to be persistent with the Hazelcast Persistence capability. Computations are restarted from where they left off after the cluster is online.
Allow jobs to be upgraded without data loss or interruption and makes use of state snapshots to switch to new job version in milliseconds.
SSL/TLS 1.2 Asymmetric Encryption
Provides encryption based on TLS Certificates between members, between clients and members, and between members and Management Center.
SSL/TLS 1.2 Asymmetric Encryption with OpenSSL
Uses the SSLEngine built in to the JDK with some performance enhancements.
Connectors are used to connect the Hazelcast job with data sources and sinks. Secure connections to external systems combined with security within the Hazelcast cluster make the data pipeline secure end-to-end.
The following connectors do have security features: Hazelcast in-memory data store, Kafka, JDBC, JMS.
Allowed Connection IP Ranges
The authentication mechanism for Hazelcast client security works the same as cluster member authentication. To implement client authentication, configure a Credential and one or more LoginModules. The client side does not have and does not need a factory object to create Credentials objects like ICredentialsFactory. Credentials must be created at the client side and sent to the connected node during the connection process.
Hazelcast client authorization is configured by a client permission policy. Hazelcast has a default permission policy implementation that uses permission configurations defined in the Hazelcast security configuration. Default policy permission checks are done against instance types (map, queue, etc.), instance names (map, queue, name, etc.), instance actions (put, read, remove, add, etc.), client endpoint addresses, and client principal defined by the Credentials object. Instance and principal names and endpoint addresses can be defined as wildcards(*).
In symmetric encryption, each node uses the same key, so the key is shared.
Hazelcast has an extensible, JAAS-based security feature you can use to authenticate both cluster members and clients, and to perform access control checks on client operations. Access control can be done according to endpoint principal and/or endpoint address.
Pluggable Socket Interceptor
Hazelcast allows you to intercept socket connections before a node joins to a cluster or a client connects to a node. This provides the ability to add custom hooks to join and perform connection procedures (like identity checking using Kerberos, etc.).
Hazelcast allows you to intercept every remote operation executed by the client. This lets you add a flexible custom security logic.
CLOUD AND VIRTUALIZATION SUPPORT
Hazelcast can be extended by cloud plugins, allowing applications to be deployed in different cloud infrastructure ecosystems.
See Hazelcast Cloud Plugins: Amazon Web Services, Microsoft Azure, Docker, Apache JClouds, Consul Discovery, Apache ZooKeeper Discovery
OpenShift Container Platform
Hazelcast Docker image is an extension of the official Hazelcast Docker image with a Kubernetes discovery plugin which enables deployment of Hazelcast on your OpenShift platform as a data processing service.