Hazelcast is primarily a memory store, and memory is widely understood to be volatile. If the power goes off, data in memory is lost. Disk is viewed as the safe option; the data will still be there after a power cut. We shall look here at what “data safety” means, and why both statements are flawed.

As we’ll see below, disk is not as safe as you think, and memory can be safer than expected. Once you understand this, you should be confident to design a safe system, with knowledge of the errors that can be recovered from and those that can’t.

Data safety

The first thing to define is “data safety“, meaning data that won’t be accidentally lost. It is retained until you explicitly delete it, or until system housekeeping is allowed to delete it.

This doesn’t need to apply to all data.

  • If you “lose” a bank statement, you can regenerate it easily if you have the transactions.
  • If you “lose” the current price of Bitcoin, it changes so rapidly there will be a new current price from the outside world almost immediately.
  • If you “lose” a user’s session, they can just login in again.
  • If you “lose” reference data, you can reload it from the source.

The above are cases where you could tolerate losing data but might prefer not. In general, most data doesn’t accommodate the above escape clauses. You want to keep it, as recreating it may not be possible or take too long. Data safety for data in memory, data on disk, data on a tape, etc means data that is highly unlikely to be accidentally lost.

“Highly unlikely” is good enough

Imagine an offline backup in a fireproof safe. This data backup is reasonably protected, but it’s not perfect. For a start, a fireproof safe is not fireproof. It is “fireproof for a while” – perhaps an hour. If the fire isn’t extinguished within the hour, the contents are damaged. It would be reasonable to assume the emergency services can get to your datacenter within an hour, but perhaps they run into delays. If the fire is large and affects residential buildings, they’ll get priority over a datacenter.

Our data is safe from the unlikely small fire, but not from the very unlikely big fire. If we decide this isn’t good enough, we put another offline backup in another fireproof safe somewhere else. Now our data is safe from the very unlikely big fire in one location and simultaneously another very unlikely big fire in the other location. Duplication lowers the odds of loss, but can never eliminate all combinations, however unlikely.

One copy is good. Two is better. Three is better still. Four is better than that. But we can never get to 100%. If our statistically analysis is correct, a series of random, unconnected events would be so unlikely that we can consider the change of total loss to be zero, we will always have at least one copy to fall back upon. This doesn’t cover planned attacks or that our analysis of the odds may be wrong!

Fire is just an example. Other disasters are available.

Don’t forget the cost

In the section above, three copies are better than two copies from a data safety perspective. You are less likely to lose three than two. However, in other ways three copies are worse than two copies. Firstly, with three copies there is 50% more infrastructure. Financial costs may be 50% higher. Secondly, with three copies there is a higher performance cost to writing than two copies. Writing takes longer, but that’s maybe not as significant for applications with low proportions of writes compared to read.

Improving data safety makes other things worse.

Data safety recovery

Based on the above, data safety is simple. Always keep enough copies of your data that you’re not worried about losing them all. Pay attention to “always keep.

Imagine your decided 2 copies provides data safety. Some catastrophic event may cause a copy to be lost. Data hasn’t been lost, you have another copy. But data safety has been lost, now you have less copies remaining (1) than you determined necessary (2). So to recover data safety needs to duplicate the remaining copy to a third location. If it was a fire that destroyed the first location, then that can’t be used. Having two copies in the second location provides no protection from a fire there.

Disk safety

Disks are fairly permanent stores, based on magnetic technology, optical technology or similar. Old data may degrade and be unreadable after a few years, but we can think of this as permanent. What we write to the disk we can read back in, even if the power has gone off in between.

The fallacy of a disk write

Unfortunately, when our software does a write to a disk file, it often doesn’t directly go to the disk. Instead, it goes to the disk controller, an intermediate module that collates writes in a buffer and flushes them in a block to improve performance.

Our software may make 5 operating system calls to write lines to a file. But behind the scenes the disk controller may have written the first 4 in a block, and buffering the last to write momentarily. If there is a crash at that point, only the 4 writes that have actually gone to disk are safe, the last is lost.

Naturally we can turn this buffering off. Safety is improved, but still is not perfect if the disk catches fire. And now performance is much worse. All we have done is swapped one problem for another, nothing to be proud of.

How many copies ?

Single and double copies are common configurations.

1 copy

“Local” disks are often used in the 1 copy configuration. There is 1 disk, 1 copy of each data record. That disk may be physically inside the host machine. It’s adjacency helps with data transfer time, it is a short cable. If a process crashes while writing 5 lines to a file on that disk, perhaps 4 or 5 lines actually were written to disk. When we examine the file afterwards, we might see 4 or we might see 5 lines. If the host machine catches fire, the disk may be destroyed. Our only copy of the data is held is lost, the entire file.

2 copies

Another configuration is a disk array, such as RAID. Typically 2 disks act as one. At least one will be distant, on a longer cable or across a network. When we write a line of 80 bytes, the same line is written to both disks. The file size might report as 80 bytes, but it needs 160 bytes. If a process crashes while writing 5 lines to a file, perhaps 4 lines had been written to one disk and 5 to the other.

If one disk catches fire, we still have the other, we have one copy of that file.

Recovery from a disk crash

In the 2 copy scenario, if a different number of lines have been written to the 2 disks that pretend to be 1. One has had 5 lines written, one has had 4 lines written. It’s pretty easy to reconcile, add the missing line and both copies are aligned.

Disk safety recap

A process that crashes while writing to disk may not have saved all content to all disk copies. This may be recoverable depending on the number of disk copies. A disk that crashes may lose all data on it. This may be recoverable depending on the number of disk copies. Disks can therefore lose data. One alone does not provide data safety.

Memory safety

If you’ve followed the above, you will have realized that memory safety and disk safety are essentially the same problem.

Memory and disk fail in different ways, but the solution (multiple copies!) is the same.

Differences in failures

Memory and disk are exposed to overlapping failure scenarios. Fire would affect both. A power loss might have different degrees of severity. For disk, the most recent content is lost. For memory, everything is lost. Memory is impacted more, but disk is still impacted, so the problem needs solved for both.

Memory safety recap

Memory is made safe by duplication. Data in the memory of one process is duplicated in the memory of other processes. The more these processes can be independent the higher the safety. So each host machine might be in a different location, use a different power supply, etc. For Hazelcast, it’s as simple as specifying a backup-count parameter. This defaults to 1, all data has one backup.

Hybrid solutions

It’s worth noting at this point that safety is achieved with backups. A memory copy may have a disk backup. If these are in different places, safety is higher.


Duplication of data in separate locations provides data safety. If you anticipate that X copies can be lost at once, you need to have (X + 1) copies in (2 * X + 1) locations. If you anticipate that 2 copies can be lost at once, you need to have 3 copies in 5 locations. If 2 locations are lost, the 1 remaining copy can be used to recreate the lost 2 copies in the 2 remaining locations.

The above is true of disk, the above is true of memory. Both are intrinsically unsafe, though this is less obvious for disk. Duplication makes either safer, until we are safe enough.

Increasing the duplication level helps with data safety and hurts in other ways.

Over the years, Hazelcast has made some significant changes to its technology, evolving from its in-memory roots to a broader real-time data processing platform that now plays a bigger role in today’s IT infrastructures. Due to added capabilities in the past several versions such as tiered storage and SQL, Hazelcast is looking more like a database management system. But don’t view it as just another database, because like many special-purpose databases today, there is significant value in using a database that was architected for your specific use cases.

We’ve all learned over the years that there is no such thing as a general-purpose database. You might have tried to use some databases in a general-purpose capacity, and depending on the workload, it might have been a short-lived effort. For example, you wouldn’t use your relational database for storing event data, or your data warehouse for OLTP workloads, as you want to use the right database capabilities to solve the technical challenges you face.

An Admittedly Oversimplified View of Databases

Your database is particularly good for two data-related functions – querying and storage. Put more simply, your database is great at reading and writing data. You write applications on your database so that you don’t have to worry about the intricacies of optimizing I/O workloads on your computer, especially when non-functional requirements like performance, scale, consistency, reliability, and security are critical.

If you tell your database to save data, it figures out the best way to write the data to storage media so that you can later query it very quickly. And if you submit a query to your database, it gives you the answer without you knowing how it found the answer. This is the “declarative” aspect of SQL, where you simply tell the database what to do without describing how to do it. And that, is your database in a nutshell.

Database Reads and WritesDatabases are reading and writing machines.

You might assert that databases are good for transforming data, and I would only partially agree considering that data transformations, including those that are done with SQL, are typically a “business logic” effort. In other words, it’s a human engineer who is good with transforming data, not the database itself. The database is merely a vehicle for transforming data, and the real workload that your database handles is the reading and writing.

You might also assert that databases (specifically RDBMSs) are powerful because of features like triggers and stored procedures. But again, these are really just transformations per above, so they’re about reading and writing data. You don’t actually build applications with these features. You use these features to make sure your data is stored in the way you want, thus preparing them for use by your applications.

Of course, I’m not saying that your databases are very limited. The complex work that runs behind the scenes in your database is extraordinary. I’m just saying that there are other capabilities you have to consider when trying to build an innovative infrastructure that drives the competitive advantage that your business seeks.

The Differences Enable New Types of Applications You Can Build

The two main components in the Hazelcast Platform that make it different from your databases are the stream processing engine and the distributed processing core. While these two components are related, they offer distinct advantages in the Hazelcast Platform.

The stream processing engine is useful for performing work on incoming streams of data, especially those that are stored in message buses like Apache Kafka and Apache Pulsar. This engine was built to run business-critical, production environments that require high levels of reliability (yes, we have features to support exactly-once processing) and extremely low latency, even at scale. Our documented benchmark shows that we can analyze one billion events per second with millisecond latency using a publicly available benchmark suite (NEXMark, created at Portland State University). This shows that we put a lot of effort into the efficiency and linear scalability of our engine to handle even the most demanding requirements.

The distributed processing core in Hazelcast makes it easy for you to write applications that take advantage of the collective CPU power of the hardware servers in your Hazelcast cluster. Other database clusters are dedicated to serving up the data, and applications are generally run on other hardware servers. With Hazelcast, your application code is submitted as a job to run near the data (“data locality”) while also automatically handling the parallelizable aspects of your applications to get the most performance out of your code. This greatly simplifies the effort in writing high-performance, distributed applications since you don’t have to worry about coordinating processes across servers. Hazelcast handles that complexity for you.

Distributed Stream Processing

Hazelcast stream processing in a distributed architecture.

Where do the Hazelcast database-like capabilities fit in, you ask? In any real-time processing environment, you need to have large volumes of data available for data enrichment and other quick lookups, to create the proper context for your data streams. By embedding database functionality into the Hazelcast Platform, you remove the complexity of integrating separate, complex technology clusters. You get distributed processing and data storage in one cluster, with no need to bolt separate pieces together. Hazelcast lets you store growing volumes of historical data while also allowing you to query it, all within the framework of a real-time system.


So now you can conclude that you should not view the Hazelcast Platform as merely a new type of database. Instead, you should explore how you can take advantage of the data streams you are storing in Kafka/Pulsar (or in any of the many messaging buses available today) and leverage Hazelcast to build real-time applications that take action on data immediately. Contact us for a quick chat to see how we might be able to help.


There are many data processing solutions out there, so how do you choose the right one for your organization? In this article, we’ll explain how Hazelcast Enterprise works and the top five reasons for implementing our platform – and the risks if you don’t.

Before we dive into the platform itself, let’s put things into context with a few big trends shaping the market. Perhaps the biggest is digital transformation, a descriptor for the collection of activities that represent big changes in how businesses pursue their data strategies, to enable them to use data more effectively.

As more organizations explore the notion of digital transformation, they are increasingly turning to the public cloud to boost agility and increase automation via artificial intelligence and machine learning. We’re seeing application architectures evolve as customers utilize microservices as a more agile and simpler way to deploy business systems. And underpinning all these digital initiatives is data, or rather the need to harness these growing volumes of data and support data-driven success.

Typically, data is processed in batches based on a schedule or a predefined threshold – but with digitization and ballooning data volumes, this approach is falling short. Enter stream processing, a practice that takes action on a series of data as it’s created, allowing applications to respond to data events at the precise moment they occur. These streams-based systems augment digital initiatives across all industries.

In a technology context, you might think of performance as synonymous with speed. But it’s more than that. After all, you can’t compromise on other dimensions for the sake of speed. A fast system with a high risk of downtime is not doing anyone any favors. That’s why you must consider performance a combination of low latency, scalability, availability, reliability, and security. From a customer perspective, this means:

  • Responsiveness – the ability of the system to take action quickly
  • Dependability – the ability of a system to keep running and take corrective actions, and
  • Privacy – the ability to allow only authorized users to access data

Hazelcast Enterprise Capabilities

All of the demands above tie into performance, and this is where the Hazelcast Platform – Enterprise Edition truly delivers:

  1. Business continuity for availability and reliability
  2. Security is about reducing the effort to protect your data
  3. High-density memory to handle scale 
  4. Lower maintenance downtime is about availability
  5. The Hazelcast Management Center

Business Continuity Capabilities 

Our business continuity capabilities start with our WAN Replication feature. It copies your data – or a subset of it – in an efficient and incremental way to a separate remote cluster for the purposes of disaster recovery, or even for geographic distribution to enable lower latency.

So, Hazelcast Enterprise supports any replication topology. A common topology is one-way replication, or active/passive replication, to enable a traditional disaster recovery strategy. Or you can have two-way replication known as active/active replication to allow sharing of the same data across different user groups that are geographically dispersed, while also supporting disaster recovery needs. As part of this replication capability, the system only sends data that has changed. This ensures minimal bandwidth usage for replication, while keeping a low recovery point objective (RPO); i.e., the estimated amount of data you could potentially lose should a site-wide disaster occur on your primary cluster. Of course, you don’t want to lose any data, so you want the RPO to be as small as possible, but at the same time, you don’t want to spend a ton of money to create that guarantee. Trying to achieve a low RPO is difficult to do on your own, so Hazelcast has provisions to reduce the risk of data loss as much as possible, cost-effectively.

The other parameter of disaster recovery is the recovery time objective (RTO); the amount of time it takes to get users back up and running. Since you have a complete copy of your primary cluster from replication – perhaps minus the last set of updates – the problem of having to recover the data is solved.

But the main issue is to get applications to quickly and automatically connect to the backup cluster, and that’s where our automatic disaster recovery failover mechanism fits in. Applications built on Hazelcast client libraries can recognize that the primary cluster is down and will automatically re-route connections to the specified backup cluster, or clusters, to ensure minimal downtime.

Security Capabilities 

Hazelcast provides a comprehensive list of security capabilities grouped by data privacy, authentication, and authorization.

In many cases, security impacts other system characteristics, such as performance. If security is applied, things tend to slow down due to the extra processing. This is the necessary trade-off you make as part of a security strategy. This expected trade-off is one reason why IT professionals defer the inclusion of security controls early in a project lifecycle. But with Hazelcast Enterprise, you get minimal performance disruption while gaining the necessary data protection from day one.

When building security into Hazelcast Enterprise, we ensured that our over-the-wire encryption had minimal impact on data transmission. Below are some benchmark results – the red line is the operations per second of Hazelcast without any over-the-wire encryption. The green line represents Hazelcast with OpenSSL for over-the-wire encryption. This shows that the operations per second are slightly slower than the unencrypted data line; roughly in the range of about 1.2 million operations per second as part of this benchmark. Then there is the comparison against the Java SSL engine, which is the purple line. There is quite a difference between using the SSL engine for encryption, and encouraging a fairly significant degradation in performance – whereas by using SSL with Hazelcast, you do not incur that much degradation.

OpenSSL Performance

Minimal performance impact using Hazelcast Enterprise with OpenSSL (green line)
versus no encryption (red line) and Java SSL (blue line).

For data privacy, Hazelcast Enterprise allows customers to plug in their symmetric encryption. This means you can use symmetric encryption to encrypt your data between nodes and clients as an alternative to TLS. If there are situations where you need an extra level of security for transmitting data, then this option will undoubtedly work for you.

Another important capability when authenticating a client with a cluster is the ability to support mutual authentication where each side has a certificate to verify who they are. Not only does a cluster know precisely who’s connecting, but the client can be assured of who they’re connecting with – reducing the risk of man-in-the-middle attacks.

In the realm of authentication, we also offer a capability known as Socket Interceptor to provide an additional level of protection. This API allows you to write custom authentication mechanisms to prevent rogue processes from joining the cluster. We also support role-based access controls so you can use a system with role-based access controls to the various data types within your system, giving you protection on the different types of sensitive data you might have. And, of course, this ties in with your existing security infrastructure like LDAP or Active Directory.

The Hazelcast Security Interceptor capability is similar to the Socket Interceptor capability, but this one sits on the authorization side to protect your data from rogue clients. By plugging in custom code, you can intercept every remote operation executed by the client.

High-Density Memory Capabilities 

You might know that garbage collection in Java systems can severely impact the performance and operation of the system, and you can’t practically allocate more than a few gigabytes of memory from the heap. In contrast, Hazelcast Enterprise uses an off-heap memory manager built to handle much larger blocks of RAM. This allows you to efficiently store more data in your computer’s memory, up to 200 GB per Hazelcast node, while eliminating the many garbage collection pauses you would otherwise face when allocating large blocks of memory from the Java garbage collector.

The key advantage here is that you can simplify your deployment by relying on fewer Hazelcast nodes. Instead of running hundreds of Hazelcast nodes on as many hardware servers, you can instead deploy tens of nodes on a much smaller cluster while getting a predictable level of performance that is not impacted by garbage collection.

An upcoming scalability feature you should look for is the new tiered storage capability, currently in beta. In its current form, it leverages a hybrid storage mechanism using both RAM and disk to scale more cost-effectively, while also addressing low-latency data access requirements. Future versions will integrate with other long-term storage media to enable further cost-effective scalability.

Lower Maintenance Downtime Capabilities 

Every system needs some downtime for maintenance – or planned downtime – and some of these features will help you further reduce the time necessary for maintenance. One capability in this category is the Persistence feature. This capability writes your in-memory data to disk asynchronously. Should your servers shut down, the Persistence feature saves your data so you can quickly repopulate your clusters in memory storage when the servers are restarted. This enables fast recovery should an entire cluster go down either because of a large-scale failure or, more likely, because of some planned maintenance. It’s one thing to have downtime because of the actual maintenance work, but having to wait for the system to get back to the original state prior to the maintenance work represents even more downtime.

Rather than repopulating the data into memory from the original systems of record, you can read that data directly from the Persistence store on disk. Data is reloaded in seconds too. For example, take a 400 GB data set. If restored from the original data sources, it could take minutes, or even hours, versus the 82 seconds it takes for the Persistence feature to restore the system.

Blue-green deployments offer another way to ensure uptime despite the requirements around downtime for the maintenance of your clusters. For this strategy, you have your production cluster called the blue cluster and a standby cluster called the green cluster. While you are upgrading the green cluster, which is not a production cluster, it is completely standby and otherwise an exact copy of the blue cluster. You can upgrade the green cluster without client interaction, and therefore, there is no risk to any users. When you are ready to migrate within the Hazelcast Management Center, you can specify some or all the clients to redirect to that green cluster. Ultimately, it turns the previously green cluster into the new blue production cluster, and the previous blue cluster becomes the green cluster. You are essentially swapping roles through the administrative interface without having to reconfigure each individual application.

At some point in the future, you can start applying upgrades to that new green cluster which doesn’t have to be restricted to only Hazelcast software upgrades. You could upgrade your operating system, hardware, and any third-party software on a cluster. Once upgraded, you can direct clients to that now ready-to-go and tested environment. If any problems arise, the blue-green deployment ensures that you have a fallback in case any of the upgrades you made cause a problem.

Job Upgrade is a feature specifically pertaining to our stream processing engine, which allows jobs to be upgraded with a new version without data loss. This is especially good for swapping out improved machine learning models in a production environment without incurring any downtime on your continuous feed of streaming data.

When you upgrade a job, the system takes a snapshot of the current state. The old job stops, and the new job starts with one command. Then the remaining data feed is sent to the new job for processing with a starting point at that snapshot, ensuring you don’t miss any of the data in that live stream.

The Hazelcast Management Center

The Management Center interface allows you to:

  • Monitor and manage your cluster members running Hazelcast, including monitoring the overall state of clusters
  • Detailed analysis and browsing of data structures in real-time
  • Update map configurations, and 
  • Take thread dumps from nodes

The Management Center also supports integration with the monitoring tool Prometheus. If you currently use Prometheus as part of your enterprise-wide monitoring system, integrating with Hazelcast Enterprise delivers a consolidated interface for monitoring all your production systems.


The best way to learn more about the Hazelcast Enterprise features is to try them out yourself. You can get a 30-day free trial with no obligation from our Get Hazelcast page. See how you can simplify your Hazelcast deployment by leveraging the powerful features in Hazelcast Enterprise.


In this blog post, we are happy to work with our community member, Vishesh Ruparelia, a software engineer at one of the leading companies in the observability space. He has completed his computer science post-graduation from the International Institute of Information Technology, Bangalore, India, and is intrigued by the technical nitty gritties of designing any distributed platform. Additionally, he is fascinated by the impact of minor technical design decisions on a platform’s ability to scale. Having learned much from the open-source community, Vishesh is always looking for opportunities to give back.


This article aims to help Hazelcast users to bootstrap a Hazelcast cluster in a Kubernetes-enabled environment. At the end of this article, you will be able to deploy and manage Hazelcast clusters, write stateless streaming jobs (using Hazelcast’s stream processing engine), and submit and manage them in the cluster. You will also get an idea of how to integrate your Hazelcast pipeline with Kafka. As a reference, we will submit a simple job that, given a sentence as input, emits the number of words in that sentence.

We are looking forward to your feedback and comments about this blog post. Don’t hesitate to share your experience with us in our community Slack or Github repository.


Before we dig deep, the article assumes that the reader has the following set of prerequisites covered:

Setting up the environment

Deploying Hazelcast cluster

We will be using the Helm package manager to deploy the Hazelcast cluster. Helm is an open-source package manager for Kubernetes and provides the ability to deliver, share, and use software built for Kubernetes [1]. 

The helm chart we will use is a part of the reference code and is located at /helm in the root of the repository. The chart is based on the official helm chart provided by Hazelcast here: https://github.com/hazelcast/charts; the only change is that we have a name override in place.

Run the following commands to deploy the cluster successfully: 

  1. cd into the root of the repository.
  2. Run:

helm install word-count -f ./helm/charts/word-count/values.yaml ./helm/charts/word-count

When you run:
kubectl get pods
You should see the following output:

Deploying Kafka

We will be using Kafka to feed data to the Hazelcast cluster. If you already have Kafka brokers and Zookeeper running in the environment, you can skip this step.

To set up Kafka in the environment, we will use the Bitnami managed Kafka docker images. Follow the following commands to install: 

  1. cd into the root of the repository.
  2. Run:
    helm install kafka -f kafka.yaml bitnami/kafka
    (kafka.yaml contains the overrides that the docker image will use.)

By default, when you run:
kubectl get pods
You should see the following output:

Deploying Data Producer

We need to deploy a Kafka producer who pushes data to the Kafka topic so that Hazelcast can consume it. The reference repository contains the data production code and the Dockerfile which can be used to deploy the producer in our environment. 

Follow the following commands to deploy:

  1. cd into the root of the repository.
  2. Run:
    kubectl apply -f sentence-producer.yaml

By default, when you run:
kubectl get pods

Running a job

We will use Hazelcast as our streaming engine and submit a job to it. Hazelcast is a distributed batch and stream processing system that can do stateless/stateful computations over massive amounts of data with consistent low latency. 

Writing a job

Hazelcast allows us to submit jobs by creating an Object of type Pipeline (we can also create a DAG object that is not in the scope; to read more, visit this link). A pipeline contains the logic of how to read, process, and emit data. The general shape of any pipeline is readFromSource -> transform -> writeToSink and the natural way to build it is from source to sink. In each step, such as readFrom or writeTo, you create a pipeline stage. The stage resulting from a writeTo operation is called a sink stage and you can’t attach more stages to it. All others are called compute stages and expect you to connect further stages to them. Detailed documentation is available here

Hazelcast has many connectors out of the box to connect different (data) sources and sinks. The extensive list can be found here. On top of this, Hazelcast also enables users to write their custom connectors if the data source/sink required is not supported out of the box – more details here. This article will use Kafka as a source and Logger as our sink.

You can find the code that creates this Pipeline in the reference repository. 

Submitting a Job

The pipeline, when running inside a cluster, is called a Job. Two ways are using which we can submit a Pipeline to the cluster: Using the CLI and Programmatically. 

Using the CLI

We can use the Hazelcast binaries to submit the job using CLI. The CLI allows us to pass on a Jar as an argument and contains the code to build and submit the pipeline. The pipeline may have some dependencies which are not present in the cluster. Thus this Jar should also include the dependent libraries without which the Job cannot run. These dependent classes will be made available in the classpath of the cluster—basically an Uber Jar.

I used this plugin in the reference code to create a shadowed jar. The shadowed Jar is created as a part of the build step itself and is stored at the location: app/build/libs/app-all.jar. Follow the following steps to submit the pipeline:

  1. cd into the root of the repository.
  2. Run: ./gradlew clean build -x test
    This creates the shadowed Jar.
  3. cd into the Hazelcast folder (which contains the binaries) downloaded as a part of the prerequisites. 

Paste the following config in the config folder inside the Hazelcast folder. Name the file hz.yaml 

cluster-name: word-count
smart-routing: false
# how long the client should keep trying connecting to the server
cluster-connect-timeout-millis: 1000

This file contains the cluster connection information binaries will use to connect to the cluster. A list of all possible configurations can be viewed here.

  1. Run: kubectl port-forward svc/word-count 5701:5701 
  2. Run: kubectl port-forward svc/word-count-mancenter 8080


bin/hz-cli --config config/hz.yaml submit --class=com. PATH_TO_REFERENCE_REPO/app/build/libs/app-all.jar

–class flag indicates the entrypoint where the logic of Job submission resides.
–config flag indicates the config file to be used to establish connection with the cluster.

  1. For cluster size <= 3, you can view the job execution details on localhost:8080


Instead of submitting a Jar, we can directly submit the job to the cluster using Java code. The sample code which achieves this can be found here

Run the following commands:

  1. kubectl port-forward svc/word-count 5701:5701 
  2. kubectl port-forward svc/word-count-mancenter 8080 
  3. ./gradlew clean build -x test
  4. java -cp PATH_TO_REFERENCE_REPO/app/build/libs/app-all.jar hazelcast.word.count.App


Once the job runs in the cluster, one of the ways to confirm is by looking at the cluster logs. Run kubectl logs wordcount-0 -f to get the log output. As we saw earlier, the pipeline is writing the output to a LoggerSink; we can see the count of words being logged below:


Viewing job metrics

Using MC

Now that the job is running, we can use Hazelcast Management Center to get an insight into the cluster. Hazelcast Management Center enables the monitoring and management of nodes running Hazelcast. This includes monitoring the overall state of clusters, detailed analysis and browsing of data structures in real-time, updating map configurations, and taking thread dump from nodes [2].

By default, the Hazelcast helm chart also deploys Management Center. We need to port forward the management center’s service to access the UI. Run:

kubectl port-forward svc/word-count-mancenter 8080

Among other things, you can head to the Jobs section to view details related to the jobs that ran in the cluster. Refer to the screenshot below:



Hazelcast also lets us view metrics programmatically using the getMetrics() API available in the Job class. Refer to this to get an idea of how to use it. 

For more details on this, refer to the Hazelcast documentation here.

Tuning the cluster

We might have to play around with the cluster and job configuration to achieve the optimum throughput. Depending on the environment you are in, different parameters can help improve your setup. That being said, the Hazelcast team recommends that we only touch the configurations we are sure of. This section will go through some of the configurations likely to help in scaling your cluster.

Vertical scaling

We can always vertically scale our cluster by giving it more resources. Note that since this is in the context of a K8s environment, you can use the helm chart to allocate the right amount of CPU and RAM to a node in the cluster. Use this section of the helm chart to tune the values. 

The number of CPUs you give to a node will directly be related to the number of Cooperative threads and the set Local Parallelism of a vertex. More on these terms later. 

The amount of RAM can help you have larger Queue sizes. 

Horizontal scaling

To scale the Hazelcast cluster horizontally, we can scale the Statefulset generated as a part of the helm installation to the number of replicas we want. Hazelcast handles the re-partitioning of the job automatically and transparently. More details can be found here.

Run the following command to scale your cluster:

kubectl scale sts word-count --replicas=NUMBER_OF_REPLICAS



Co-operative threads

An important configuration that helps in scaling is the number of cooperative threads per node. This defaults to the number of CPUs allocated to a node. In most scenarios, the default is good enough. In any case, you can tune this number to your requirements here. This document talks in detail about the cooperative multi-threading model which Hazelcast uses. 

Vertex parallelism

Another configuration that goes hand in hand with Cooperative threads is Vertex parallelism. Each vertex/stage in the Hazelcast pipeline can have specific parallelism defined—an example of how to tune this can be found here. You can refer to the docs here for more information about this. Please note that Hazelcast doesn’t create new threads for the parallelism value. Instead, it uses cooperative threads to reduce context switching. For vertices that are non-cooperative, a new thread will be created. 

Other parameters

We can configure the queue size if we see high GC or high memory usage. Queue size can be set like this. Queues act as a buffer between the vertices of a pipeline. Information on types of edges can be found here.

Latency can be affected depending on the type of processing guarantee we have set. It can be set as a part of the Job config (for example, here). Choose AT_LEAST_ONCE over EXACTLY_ONCE for better performance if the pipeline can gracefully deal with duplicates. More details are found here.

A complete example of available cluster configuration can be found here.



This blog post aimed to explain how to help Hazelcast users bootstrap a Hazelcast cluster in a Kubernetes-enabled environment. By deploying and managing Hazelcast clusters and writing stateless streaming jobs. We also explained how to integrate your Hazelcast pipeline with Kafka—submitting a simple job that, given a sentence as input, emits the number of words in that sentence. Don’t hesitate to share your experience with us in our community Slack or Github repository

Further readings






[1] https://www.cncf.io/reports/helm-project-journey-report/ 

[2] https://hazelcast.com/product-features/management-center/ 



To boldly go where no stream processing platform has gone before!

If you’re already familiar with the Hazelcast Platform, see our ‘Show me the code’ blog post, which jumps right into 5.2 features and shows you how to use them.

Activate the countdown clock and prepare for launch!

T-6 hours and counting…

Akin to loading the rocket’s external tanks with liquid hydrogen and liquid oxygen propellants, bounded (batch) data and unbounded (streaming) data can be loaded into the Hazelcast Platform using Hazelcast Connectors. These out-of-the-box connectors provide an easy and efficient way to access data sources (wherever they are, including files, messaging systems, databases, data structures, etc.) and support multiple data formats (text, CSV, JSON, and Avro).

Now with Hazelcast Platform 5.2, the Hazelcast JDBC Connector provides a configuration-driven capability to connect to any JDBC-compliant external data stores. New data stores can also be added dynamically at runtime. SQL over JDBC can map the table in external data stores to relevant data structures in the Hazelcast Platform, along with performing CRUD operations. The zero-code connector kicks this up a notch; it enables the setup of read-through / write-through / write-behind data access to a relational database (no need to programmatically load data using MapStore!). This connector is in beta and currently supports AWS RDS for MySQL and PostgreSQL. This release further improves the Change Data Capture CDC Connectors based on Debezium that turn databases into streaming data sources.

T-9 minutes and counting… 

Main engines start… The Hazelcast Platform, powered by the proven low-latency data store and real-time stream processing engine, provides a unified platform for real-time applications. The platform does the heavy lifting of ingesting, processing, storing data, and making it available in one distributed, easy-to-scale, highly available cluster so that developers can focus on the business logic.

The low-latency data store supports multiple serialization schemes (Serializable, Portable, etc.), and we’re happy to announce GA status for compact serialization. Compact serialization is the recommended and de facto standard object serialization mechanism. It is highly performant as it separates the schema from the data, uses less memory and bandwidth and supports partial deserialization of fields without deserializing the whole object during queries or indexing. 

In case of engine failure, we have you covered as we’ve added more self-healing capabilities! The automated cluster state management for persistence on Kubernetes was enhanced to support the cluster-wide shutdown, rolling restart and partial member recovery from failures.

Hazelcast Platform 5.2 significantly enhances the SQL capabilities of the Calcite-based SQL engine residing atop the underlying stream processing engine. SQL-based data pipelines can be constructed to ingest and transform data. ANSI-compliant SQL can filter, merge, enrich, and aggregate data. Streaming SQL with temporal filters for both tumbling and sliding windows is supported. As an example, the average price of a stock can be computed for the past hour or for every hour since the market opened simply by defining the sliding window. SQL can be used to combine multiple streams and handle late-arriving records using watermarks. Combining streams of orders and shipments, as an example, can help generate more accurate fulfillment data on a real-time basis.  SQL select statements together with JSON Path syntax can be used to query JSON objects and compute object and array aggregations. We have also extended SQL support to query nested java objects.

The latest release delivers on the Hazelcast Platform’s capability to support storage of much larger volumes of data while maintaining low latency access to this data. Tiered Storage, an enterprise feature, provides spill-to-disk functionality which eliminates the constraint on storage being limited by the cluster’s memory size. Storing data across tiers, memory and disks, allows larger volumes of data to be stored while the intelligent migration of this data across memory regions and disks allows us to provide low-latency access. 

So how do I access data across the layers ? Users can simply use SQL to transparently query these larger datasets spread across memory and disk. Tiered Storage lowers TCO by reducing the reliance on separate storage technologies. And it does this while delivering low latency storage in a reliable, consistent manner. We expect the ability to scale further in a cost-effective manner as we introduce additional tiers such as Amazon S3.

T-0: ignition & liftoff!

The data has now been ingested, processed and stored, it is now ready for consumption! The familiar Management Center that provides cluster management and monitoring tools, has an enhanced “SQL browser” to execute streaming SQL queries. This release also introduces Hazelcast Command-Line Client (CLC) in beta. CLC is a command-line tool to connect to and operate on Hazelcast Platform. Popular SQL tools such as DBeaver can also be used to connect to the Hazelcast Platform.

And off it goes…

We’re ready to handover the launch codes, I mean the license keys to you. You can take Hazelcast Platform spaceship on an interplanetary test drive with Hazelcast Viridian Serverless or just explore the new features in the comfort of your enterprise data center!

Wrapping Up

You can read more about what’s new and what’s been fixed in our release notes. What’s more, in GitHub, we’ve closed 250 issues and merged 540 PRs with Platform 5.2.

If you’d like to become more involved in our community or just ask some questions about Hazelcast please join us on our Slack channel,  and also please check out our new Developers homepage.

Again, don’t forget to try this new version and let us know what you think. Our engineers love hearing your feedback and we’re always looking for ways to improve.  You can automatically receive a 30-day Enterprise license key by completing the form at https://hazelcast.com/trial-request/ 

I’d like to thank James Gardiner and Nandan Kidambi for their contributions and review of this blog post.

All trademarks are the property of their respective owners.

Leading businesses have already benefited from reduced costs, better agility, and increased revenue in the short time that they have been delivering real-time services to their ever-demanding consumers. The key to their success was to embrace the challenges that real-time processing presented at the time in exchange for class-leading, exemplary customer experiences and the associated benefits that came with that.

But true real-time is hard to achieve. That is, being able to act on information in the moment. While, when, during and before an event, not after. Many providers still claim to deliver real-time, while waiting on databases and/or other services, causing lag and processing data after the event and missing out on real-time benefits. True real-time involves continuously collecting, analysing, canonicalizing and performing functions (like processing against rules or machine learning models) on the data “in-flow” outside of a database or any other service – in other words, the entire process should be run on very low latency storage and delivered to users or downstream processes long before it touches a database.

With Hazelcast Platform 5.1, we delivered true real-time capabilities that would help drive greater innovation. Our platform hosted a high-performance stream processing engine tightly coupled with low latency storage and machine learning inference capabilities. This enabled developers to quickly build high-performance, intelligent, business applications that could ingest huge volumes of event-based data and deliver valuable real-time insights for any use case.

True real-time involves continuously collecting, analysing, canonicalizing and performing functions on the data “in-flow” outside of a database or any other service.

We also enabled businesses to deploy where they wanted – in any cloud, on-premises or hybrid – and delivered a fully managed service for those not wanting to manage infrastructure operations and provided connectivity to any data source and destination, benefiting omnichannel partnerships by providing a seamless experience across the divide.

Fortune favours the curious!

So true real-time processing sets leading businesses apart from the rest. But new ways of discovering value and driving innovation are being sought in an ever-competitive market.

Even more vast streams of data, in various formats from many distinct sources, are being generated from social media, smart devices, mechanical device sensors, et al. And, that volume is growing at an alarming rate. Given the proven benefits of real-time thus far, businesses are curious whether hidden patterns exist within untapped, disparate data that can offer undiscovered insights. Processing more quality data can yield more accurate, context-rich insights, removing the need for guesswork and assumptions when acting on data. To do this, you need the ability to ingest and process against even larger volumes of data while delivering value within the business SLA. 

In addition, there is a need to grow the data analysis user base. Historically, only highly qualified specialists would be capable of analysing data. With more modern, digital, intuitive applications emerging, the need to enable access to business-savvy users increases, thus reversing the historical trend of data analysis being a niche, polarised activity. Imagine larger numbers of users acting on enriched and scored data while still “in flow.” Rather than a funnel approach to discovering insights, broadening the scope to more users with various skills and expertise will drastically increase the chances of finding many insights from different business areas. Analysing data using SQL increases the likelihood of analysis adoption as many organisations use SQL to create and manipulate data.

Innovation creates opportunities. With vast quantities of data from various data sources, there is an appetite to determine whether insights can be realised by correlating multiple streams of changing event data based on conditions common to both streams of events. 

For example, imagine an inventory operations application that tracks when orders have been shipped. You would have two streams of event-based data: one for orders and one for shipments. You would need to join these streams on the order-id to find out which orders were successfully shipped. Or a railway service provides real-time updates on train arrival and departure times and whether a passenger will miss their train based on their current location and travel speed. This would involve processing two streams of information: the passenger’s GPS position and the train’s GPS position, in real time. You would need to join these two streams with a common condition like the train’s identification – generated when the passenger purchased their ticket.

In both these use cases, static reference information would be of limited value as one condition would not be in real-time. So correlating constantly changing data from different sources can add another dimension to data enrichment based on moving data rather than static.

Reinforcing the Real-Time Economy with Hazelcast 

Hazelcast Platform 5.2 extends on what was already a proven solution to the true real-time problem, with the three core tenets in mind:

Acting on vast, continuous streams of data, immediately

Extending low-latency storage removes limitations on the amount of data that can be realistically processed. Tiered storage enables data to extend to high-performance SSDs where use cases require vast volumes of data growing to tens or hundreds of terabytes. This adds the additional benefit of cost optimisation and deployment practicality (rightsizing the number of clusters, for example). With fraud detection, more historical reference information (i.e., 10 years rather than 2 years) creates a more comprehensive picture, reducing the chances of assumptions and guesswork. More reference data can lead to greater analytical accuracy. 

Providing the easiest access to data for everyone, everywhere, at any time

Zero code connectors broaden the appeal for non-technical business leaders to create enterprise applications, allowing them to customise the functionality to their specific needs due to a more simplified application development approach. A familiar declarative command structure simplifies joining reference data from relational databases to data streams for a broader range of users. Streaming SQL provides flexibility for managing streamed data exceptions, and JSON support simplifies integration with the vast majority of JSON-based technologies.

Uncovering priceless revelations from seemingly distinct, de-coupled but concomitant data moments 

Stream-to-stream joins enable the matching of continuous streams of related data to determine correlations “in-flow” as the data is being created. Answering more complex questions as concurrent events unfold allows businesses to present “what if” scenarios across previously deemed inaccessible or unfeasible data. 

Other capabilities broaden the range of options for better management, even lower administration, and greater productivity.

Get Started with Hazelcast

For additional information on our latest release, check out our press release, Hazelcast Alleviates ‘Wait-and-See’ Paradigm Limiting Real-Time Applications, and our technical blog, The Hazelcast Platform Rocketing to the Next Level.

You can get started with the software by downloading it from here. Also, check out our Hazelcast Viridian Serverless offering, which helps you get started with real-time applications in the cloud. A sign-up form is on the aforementioned download page.

This blog is written for those familiar with Hazelcast and prefer to look at code snippets to get started quickly. For a list of new features in Hazelcast Platform 5.2, I recommend starting with this blog.

Compact Serialization

Historically, Hazelcast has supported multiple serialization schemes (serializable, portable etc.) and Hazelcast is happy to announce that compact serialization is now GA and the recommended object serialization mechanism as it offers several advantages.

Compact serialization separates the schema from the data and stores it per type, not per object which results in less memory and bandwidth usage as compared to other formats. It also supports schema evolution, and partial deserialization of fields without deserializing the whole object during queries or indexing.

Let’s look at how we would use compact serialization for an “Employee” class. 

public class Employee {
    private long id;
    private String name;
    public Employee(long id, String name) {
        this.id = id;
        this.name = name;

    public long getId() {
        return id;

    public String getName() {
        return name;

As you might’ve noticed already, compact serialization does not require any changes in the user class as it doesn’t need the class to implement any interface!  Hazelcast will automatically extract the schema out of the classes and Java records using reflection, cache and reuse it for the subsequent serialization and deserialization, at no extra cost.

The underlying format of the Compact serialized objects is platform and language independent.

For more details, please check out our latest documentation on compact serialization.

JDBC SQL Connector

Instead of programmatically configuring an external data store as a JDBC source / sink, it is now possible to define the JDBC connector declaratively using XML / YAML as shown below. This connector is currently in beta.

    class-name: com.hazelcast.datastore.JdbcDataStoreFactory
      jdbcUrl: jdbc:mysql://dummy:3306
      username: xyz
      password: xyz
    shared: true

To query these external data stores, you can simply create a mapping with the JDBC SQL connector. In the example below, we are creating a mapping to a JDBC connector that references the mysql-database as the external data store.


Zero-Code Connector (MapStore / MapLoader)

Many of you are already familiar with using MapStore & MapLoader with Hazelcast. MapStore configuration can now support handling operations asynchronously to avoid blocking partition threads. Platform 5.2 introduces a low-code approach with generic MapStore which requires little or no Java code. The generic MapStore is a pre-built implementation that connects to an external data store, using the external data store configuration. 

This connector is currently in beta and supports AWS RDS for MySQL and PostgreSQL.

        enabled: true
        class-name: com.hazelcast.mapstore.GenericMapStore
            external-data-store-ref: my-mysql-database
            table-name: test

Streaming SQL 

In Platform 5.1, we added SQL support for aggregating streams into tumbling and hopping window.  This provided the ability to run functions and aggregations such as sum, count etc. snapped to a time window. 

We’ve now added support to combine multiple data streams. In the example below, we are combining the data from shipments and orders topic in Kafka.

We start by creating a mapping to the Kafka topic e.g., shipments.

  order_id INT,
  warehouse VARCHAR
TYPE Kafka
  'keyFormat' = 'varchar',
  'valueFormat' = 'json-flat',
  'auto.offset.reset' = 'earliest',
  'bootstrap.servers' = 'broker:9092')

Then, we create a query view which requests all events from shipments Kafka topic, and allows a 1-minute lag for incoming events.

CREATE OR REPLACE VIEW shipments_ordered AS
  TABLE shipments,

  TABLE orders,

Finally, we join the events from orders_ordered and shipments_ordered streams to get all shipped orders within 7-days time window in real-time

SELECT o.id AS order_id,
  s.id AS shipment_id,
FROM orders_ordered o JOIN shipments_ordered s 

ON o.id = s.order_id AND s.ship_ts BETWEEN o.order_ts AND o.order_ts + INTERVAL '7' DAYS;

More details on this example can be found in our joining multiple streams tutorial.

You can try all these SQL examples using the SQL Browser in the latest Hazelcast Management Center. 

Tiered Storage

Tiered Storage offers the opportunity to store datasets in Hazelcast that are much larger than the available memory.  For example, in the past, you would need to configure an eviction policy to remove least frequently used (LFU) or least recently used (LRU) data from the memory store when limits approach.  Now with Tiered Storage, Hazelcast can move data automatically between tiers, between memory and disk.

If your Enterprise license was generated before Hazelcast Platform version 5.2, you’ll need a new Enterprise license that enables the Tiered Storage feature and you will need to run Platform 5.2 or later.

Configuration (YAML) for tiered storage:

    enabled: true 
      base-dir: "tiered-store" 
      in-memory-format: NATIVE 
        enabled: true 
            unit: MEGABYTES
            value: 256
          enabled: true 
          device-name: "my-disk"

Tiered Storage is disabled by default; set the “enabled” parameter true to enable it. 

Capacity of the memory in which the frequently accessed data will be stored cannot be set to 0. The default value is 256 MB. Other unit options are BYTES, KILOBYTES, MEGABYTES and GIGABYTES.

Disk-tier enables using disk as an additional (overflow) tier for storage and specifying the name of the device (disk).

For more details on configuring Tiered Storage, you can refer to this page

Wrapping Up

In addition to these features, we’ve made improvements to our CP subsystem and split-brain healing. The automated cluster state management for persistence on Kubernetes was also enhanced to support the cluster-wide shutdown, rolling restart and partial member recovery from failures.

You can read more about what’s new and what’s been fixed in our release notes. What’s more, in GitHub, we’ve closed 250 issues and merged 540 PRs with Hazelcast Platform 5.2.

If you’d like to become more involved in our community or just ask some questions about Hazelcast please join us on our Slack channel,  and also please check out our new Developers homepage.

Again, don’t forget to try this new version and let us know what you think. Our engineers love hearing your feedback and we’re always looking for ways to improve.  You can automatically receive a 30-day Enterprise license key by completing the form at: https://hazelcast.com/trial-request/

Finally, I’d like to thank Frantisek Hartman and Sasha Syrotenko for their contributions to this blog.

Whenever I talk about today’s business trends with industry analysts who cover data platform technologies, they often mention the misconception in the market about the relationship between real-time responsiveness and software speed. They want to be clear to all business professionals, especially the IT folks, that just because your infrastructure components are fast doesn’t mean you can deliver real-time responsiveness. Speed is a necessary component of real-time systems, but it alone is insufficient. That’s because “real-time” is not the same as speed. There’s undoubtedly a strong correlation between the two concepts, but real-time is about more than just speed.

While real-time systems often rely on low-latency responsiveness, simply having a low-latency system is not enough to deliver a real-time system. A perfect example is the set of database technologies that let you run extremely fast end-user queries. You might get a low-latency result set for your queries, but that ignores the fact that your data has been sitting idly in your database, waiting for an end user to query it. By contrast, there is enormous value in acting on data as soon as it is created, and fast database queries do not address that newborn data. As a result, you’re missing out on real-time business opportunities.

Even though it’s incorrect to do so, we understandably equate the two concepts of real-time and speed because of the real-world examples we commonly associate with real-time. Airbags, self-driving cars, and responsive web pages, to name a few examples, all have a critical need for speed, and all have built-in service-level agreements (SLAs) set by the designers of these systems to suit the expectation of the given scenario. You don’t want an airbag to take 2 seconds to inflate, so the SLA is much shorter than that. And for web pages, you don’t need microsecond responsiveness since humans can’t operate at those speeds (not to mention that we don’t even have the technology infrastructure to run at that level), so the SLA can be a little more relaxed. In both cases, there are expectations on what is necessary for speed, so we conclude that speed is the key.

However, it is easy to view speed (aka, “performance”) too narrowly. To extend the previous example, we know that an airbag must inflate in less than 30 milliseconds, and if it does so, it has succeeded. But remember that other processes must happen before the inflation of the bag. The airbag system must accurately detect a crash and then send a signal to inflate the bag. If this detection and the alerting system take too long, it doesn’t matter how fast the bag increases. Imagine having a detection system that takes 5 seconds to identify a crash, but the airbag inflates in 30 milliseconds. That doesn’t sound like real-time to anyone.

Actual real-time IT infrastructure today should be able to deliver “right now,” no matter where your data is in its lifecycle. What is the freshness of your information–was it just now created? Is it in flight from point A to B? Or is it already stored in a database? Each of these lifecycle stages is critical if real-time is the objective.

For example, if your real-time data represents a customer interaction, can your system act on it “right now”? Or is your system only designed for action after the data gets stored somewhere? Your IT infrastructure can add significant business value if it acts on data when a necessary step is taking place, like while a customer is banking, during a fraudulent event, while someone is shopping online, or when a patient needs attention, to offer a few examples.

These types of value moments are difficult to capture without the right technologies. Hazelcast solves businesses’ challenges when deploying real-time responsiveness by providing a robust data platform that allows you to act on data immediately. The Hazelcast Platform provides a high-performance stream processing engine integrated with a low-latency data storage engine to enable automated responses on real-time data so that “real-time” becomes more than just a fast query at the tail end of your data lifecycle.

Please contact us to learn more about how Hazelcast can help you in your real-time initiatives, and we’ll be happy to share more information. For you technical folks, learn more about Hazelcast by signing up for Hazelcast Viridian Serverless, the easy-to-get-started, cloud-managed service powered by the Hazelcast Platform. You can get started for free in the cloud to see how real-time technology can help with your IT projects.

In this blog post, we are happy to work with our community member, Nico Krieg, a software developer, DevOps engineer and professional caffeine consumer. Below is Nico’s perspective on testing a Hazelcast cluster. Enjoy the read!

If you’re a software engineer, testing your source code is likely a vital part of your daily business, and it’s also likely that some (ideally, a high) degree of automation drives the tests. Maybe you’re even a fan of test-driven development, in which case you’ll surely agree that iterating between writing test code and writing business code is a super useful approach for improving code quality. As a software engineer, you need quick, reliable, and repeatable feedback on the fitness of your release candidates – the quicker the feedback arrives, the better, because it allows you to speed up your iterations on the code. Automating the testing process is the foundation for achieving this goal. The sheer abundance and variety of the various testing tools, libraries, and frameworks highlight the importance of building powerful automation for testing source code.

Testing A Hazelcast Cluster 

Until here, you may be a bit surprised about the introductory section – what on earth does the software engineer’s view on testing have to do with Hazelcast and Hazeltest? But, trust me, it will all make sense, so stay with me to connect the dots.

Testing is Testing is Testing 

You may have already guessed where this is going: If you’re not testing a piece of source code, but a Hazelcast cluster – or any technology in IT that has to fulfill a certain set of requirements – the kind of release candidate you’re supposed to test for fitness will be different, and hence the tests, as well as the testing process, will differ, but the goal for both is pretty much the same: to get quick, reliable, and repeatable feedback on the release candidate’s fitness, where fitness is defined as the ability of the release candidate to satisfy a certain set of requirements. And, just like in the software engineer’s view on testing, automation plays an important role in achieving this goal.

Experiences Thus Far 

Because bits of the testing process and tools are a result of how we – a small team in a large corporation of the financial sector – use and deploy Hazelcast, we’ll have to take a look at this first. Don’t worry, I’ll make this as short as possible.

How We Use Hazelcast 

Our team provides Hazelcast as a consumable middleware component in the corporation’s Kubernetes clusters. The means for formulating the “deployment contract” and rolling out Hazelcast is Helm, so the release candidates in our case are not the Hazelcast clusters as such, but the Helm charts which produce them. Hazelcast has been introduced to the corporation only very recently, so the processes surrounding the technology are still being shaped.

Internal teams responsible for various business applications tell us their requirements for data structures in Hazelcast (e.g. the naming and configuration of the data structures as well as the number and size of the entities to be stored, and the number of clients that will connect to the Hazelcast cluster). Other requirements for the more general architecture of the Hazelcast cluster, such as how the WAN replication is set up, are the result of a long analysis phase conducted shortly before I joined the project. When a team wants a configuration for one of their data structures to be updated, we modify the Helm chart to be deployed for the next release, thus creating a new release candidate.

There are two dimensions to making sure those release candidates are fit for production, namely, validating the chart’s syntactic correctness, and verifying the ability of the resulting Hazelcast cluster to satisfy all client requirements. The first part is very easy and, in its simplest form, can be achieved by simply running a helm lint in the build pipeline (we do go a bit beyond that in our pipeline with the help of a Helm chart quality gate that performs some additional checks on the rendered chart, but those checks, too, are relatively simple). The second bit – testing the Hazelcast cluster – is substantially more complex because the configuration of all data structures plus the configuration of the cluster itself have to be tested.

In the upcoming two sections, you’ll get introduced to the two ways we’ve used thus far for testing Hazelcast. (Spoiler alert: None of them was satisfying.)

Generate Load on Applications 

The most obvious way for testing our Hazelcast clusters is to simply use the aforementioned business applications to generate load. They get deployed to the same Kubernetes clusters as containerized workloads and use Hazelcast to store state information, so when their endpoints are queried, they will put entries into specific data structures in Hazelcast. Thus, one can generate load there by simply invoking the applications’ endpoints, and while the teams responsible for them do that anyway for testing the integration with Hazelcast, this approach has a couple of drawbacks that disqualify it as means for us to perform automated testing:

  • We don’t have access to the applications’ source code, so it’s hard to understand exactly how each application uses Hazelcast.
  • The applications’ APIs are not under our control – some endpoints require authentication with test users we’re not allowed access to, so in these cases, we have to reach out to the responsible team simply to tell them to press the Go go gadget load test button somewhere, which gets old pretty quickly.
  • Most importantly, the way the applications behave is – of course – dictated by their business requirements, so even their developers can’t change the behavior arbitrarily – for example, when an application requires two maps in Hazelcast to function correctly, we can’t tell the developers to make the application use 5.000 instead.

So, (mis)using the business applications for load generation on Hazelcast doesn’t seem to be a very effective approach. What other options are there?

Use PadoGrid 

In case you haven’t heard of PadoGrid yet, it might be worth a look if you work with IMDG, messaging, or computing technologies such as Hazelcast, Redis, Hadoop, Kafka, or Spark. According to the project’s GitHub repo, the application (…) aims to deliver a data grid platform with out-of-the-box turnkey solutions to many enterprise architecture use cases. It’s easy to get started, so if the above sounds interesting to you, go give PadoGrid a try!

In the early days of testing our Hazelcast clusters, we almost exclusively used PadoGrid for load generation. To do so, we installed PadoGrid as a Pod next to the Hazelcast Pods, injected a test configuration, and then started an embedded test application, which gave us the ability to ingest as much data and as many entries as we wanted into the configured maps. This was sufficiently powerful to exhaust the memory and WAN replication capacity of the Hazelcast cluster under test in order to verify the cluster remained stable.

Testing with PadoGrid was notably better than using the business applications, but because PadoGrid has a much wider scope than merely testing Hazelcast, the load generation mechanism is subject to some limitations, which became an issue once the internal applications wishing to consume Hazelcast became more numerous and their requirements more sophisticated:

  • PadoGrid offers its own simple, properties-based language to define tests, and by using it, it’s very easy to formulate a short-running test accessing a small number of data structures, but – revisiting an example made earlier on – to use, say, 5,000 maps, you’d have to hard-code 5,000 data structures as well as the operations to perform on them into the test configuration. Thus, the PadoGrid-based testing is limited to two load dimensions, namely, the number of entries as well as their size (hence the total amount of data to be stored in Hazelcast), but the notion of load consists of additional dimensions for Hazelcast, namely, the number of maps (which you can control in PadoGrid tests, but only to a very limited extent) and the number of clients. (Simply scaling out the number of PadoGrid instances would not have corresponded to an equally increased number of clients because the embedded testing app has to be launched within the Pod on an injected test configuration, and while an automated start would have been possible by assembling a new OCI image with a modified entry point and injecting the test configuration via a ConfigMap in the Helm chart, this wouldn’t have solved the other issues.)
  • There is no possibility to configure randomness in the test configuration, so the load generated is very static and hence not realistic.
  • The PadoGrid test language does not allow for defining test loops, so there is no way to come up with a long-running test (beyond simply copy-pasting the same test group definitions over and over again, that is).
  • PadoGrid is not very talkative, so it’s difficult to figure out whether, for example, a null read on a key constitutes an issue with the Hazelcast cluster (e.g. memory exhausted, eviction kicked in) or is due to a faulty test configuration.

Sitrep: “Improvement Desirable” 

The testing approaches outlined so far, then, did not provide the flexibility and control required to come up with the load scenarios necessary to test all aspects of our Hazelcast clusters. As a result, we found ourselves in a situation where we weren’t able to reproduce some issues reported by the teams consuming Hazelcast, making the troubleshooting very inefficient. Thus, we needed a way to improve on the versatility of load generation.

You may have also noticed the term “load generation” was used almost synonymously with “testing” in the sections above. This inaccuracy is due to load generation being the first aspect to testing Hazelcast clusters, so to get this aspect right is the minimum requirement for verifying a specific configuration is fit for production. Thus, because the load generation aspect is the starting point and we haven’t gotten it right yet, there was a lot of emphasis on this point, but there are other aspects, too, of course, namely, the collection of load generation results (i.e. reporting) and then acting on those results (i.e. their evaluation). Spoiler alert: Hazeltest will focus mainly on load generation, too.

Introduction To Hazeltest 

The necessity for more versatile load generation gave birth to Hazeltest – a manifestation of the idea that we needed a dedicated application for testing our Hazelcast clusters. I decided to implement Hazeltest in my private time so the application can be open source, and my hope is that teams in other organizations may draw some benefit from the application, too.

Let’s start our tour of Hazeltest with a list of requirements for the application.


Based on the drawbacks identified in the testing approaches employed thus far, Hazeltest must satisfy the following requirements:

Ease of use. Hazeltest must provide intelligent test logic supplied with default configurations that make sense for most use cases. The idea here is that launching a test against a Hazelcast cluster should be as straightforward as simply firing up the application.

Configuration flexibility. Yet again revisiting an example made earlier: Whether an application uses 2 maps in Hazelcast or 5,000 matters significantly, even if the number of entries stored across them and the size they amount to are roughly equal, and the same is true for other data structures. This is because Hazelcast must do some housekeeping work for each instance of a data structure, and hence the load generation possibilities become a lot more numerous when Hazeltest supports quickly and easily adjusting the number of used data structures. Other areas where flexibility is required include the size and number of entries, the number of clients, and the timing of operations. Being able to configure these things flexibly contributes to making Hazeltest more versatile because it will be able to generate a much wider range of load scenarios.

Elaborate logging. Understanding how complex workflows play out in various load or error scenarios on the client side helps isolate misconfigurations on the Hazelcast cluster side, hence Hazeltest – acting as the client – should inform elaborately about what it’s doing all the time. This information should be provided in such a way that logging platforms like Splunk or your friendly neighborhood ELK stack can easily index, search, and visualize the data.

Optimized for Kubernetes. As mentioned earlier, we run Hazelcast on Kubernetes, and all business applications consuming it get deployed to Kubernetes, too, so Hazeltest will run exclusively on Kubernetes (at least in this context, but designing the application to run on Kubernetes doesn’t mean one couldn’t simply launch the binary as-is on a trusty Linux machine, for example). For Hazeltest, this means the following:

  • Necessity for liveness and readiness probes so the Kubernetes Deployment Controller won’t have to fly blindly when rolling out a new version, for example
  • Small resource footprint so our Kubernetes clusters will be able to run a four-digit number of instances quite comfortably
  • Scale-outs (e.g. going from one replica to 500) should happen quickly, so the application must have a very short bootstrap time

Intelligent test logic. The concept of intelligent test logic was mentioned above, and it is so central to the usefulness of Hazeltest that it deserves its own requirement. Basically, the idea that can be derived from PadoGrid’s testing capabilities is this: The freedom provided by the ability to define tests by means of the aforementioned properties-based language in plain-text files is amazing, but comes at the cost that defining long-running, complex tests is tedious. So what if you don’t need that much freedom, but simply a couple of pre-built “test loops” able to generate realistic load on a Hazelcast cluster? So, after start-up, Hazeltest should immediately engage test components that create realistic and heterogeneous load on the Hazelcast cluster under test, and ideally, these components can easily be run in an endless loop and come with reasonable defaults.

Meet Hazeltest! 

Hazeltest is the result of the thoughts outlined above in combination with a whole lot of coffee, and you can find the application here on my GitHub.

The Foundation: Golang 

Hazeltest is implemented in Golang for the following reasons:

  • Container images for Golang applications tend to be very tiny
  • Golang applications are enormously fast since they’re compiled directly into a binary file rather than requiring a virtual machine of some sorts to be run
  • The language was built for concurrency, and as you’ll see later on, Golang’s concurrency features sit at the very core of Hazeltest
  • Golang is awesome and I’ve been on the lookout for a useful, real-world project to learn it in, so learning Golang by implementing Hazeltest is really hitting two birds with one stone
  • As a bonus, Golang has the most awesome mascot since the dawn of mascots!

So far, it seems that Golang is the perfect foundation for Hazeltest – the application starts in less than two seconds on our Kubernetes clusters (including the readiness checks becoming available, which take into consideration a successful connection to Hazelcast) and its image weighs in at roughly 10 MBs of compressed size, so even hundreds of instances requesting the image from the internal registry (in case of imagePullPolicy: Always) isn’t that big of a deal. Plus, goroutines are a really good fit for the individual runners the application offers (you’ll see considerably more of those runners later on).

Most importantly, though, implementing Hazeltest in Golang has been tremendous fun thus far, and I’m looking forward to learning more of the language and to applying it in the context of Hazeltest.

Run, Forrest, Run 

At the very heart of a testing application sits – no surprise – some kind of testing logic, and in Hazeltest, this logic has been encapsulated in self-sufficient runners: For each data structure available in Hazelcast (maps, queues, topics, …), a set of runners puts load on Hazelcast (that’s the goal, at least, at the time of writing this, two runners are available for maps and queues each). Each runner has its own distinct “feature” in terms of the data it stores in Hazelcast, such that all runners combined create heterogeneous load. Those runners are also where Golang’s easy-to-use goroutines come in very handy as each runner gets its own goroutine so it’s isolated from the others. All runners for one kind of data structure instantiate a common test loop implementation, which encapsulates the actual interactions with the Hazelcast cluster, like ingesting, reading, and deleting data.

Encapsulating all configuration and a unique data structure to write to Hazelcast in a runner also has a nice side effect on the source code: Adding more runners does not increase complexity beyond simply making the program larger because the runner implementation is completely self-sufficient and so only adds another unit to the source code, but introduces no additional coupling to existing components other than to the test loop unit.

To take into account the “ease of use” requirement, Hazeltest immediately starts all enabled runners after bootstrap has been completed, so creating more load on Hazelcast is as simple as spawning more Hazeltest instances. In addition to that, the runners are configurable by means of a simple Yaml file, and each runner comes with sensible defaults so it is not mandatory to configure all properties.


Live Demo 

Using Hazeltest 

From here on, if you would like to follow along, you’ll need a reasonably juicy Kubernetes cluster at your disposal. In case you don’t have one yet, you may find this and this material useful.

Once your Kubernetes cluster is ready, clone the Hazeltest repo and navigate to the resources/charts directory. All commands and explanations in the following sections involving a file system path assume you’re in this directory.

Enabling And Disabling Runners 

At the time of writing, four runners are available in Hazeltest: two for Hazelcast maps and two for Hazelcast queues. Each can be enabled and disabled individually by means of the hazeltest/values.yaml file, whose config element gets templated into a ConfigMap. The config element comes with one section for map tests and one for queue tests, encapsulating the map runners and queue runners, respectively. Each runner config, in turn, contains an enabled element you can use to enable or disable the runner in question. So, for example, the following config disables all runners but the map load runner:

      enabled: false
      enabled: false
      enabled: false
      enabled: true

The following is a statement made earlier in this text: In addition to that, the runners are configurable by means of a simple Yaml file, and each runner comes with sensible defaults so it is not mandatory to configure all properties. Said defaults reside alongside the source code itself in the form of the defaultConfig.yaml file (which you can find in the program’s client package) and you’ll notice that, by default, all runners are enabled. So, what the config shown above does is overwrite the values for these specific properties. (That being said, it wasn’t necessary to provide the config.maptests.load.enabled property, it was included above only for clarity.) We’re going to make use of this overwriting mechanism a lot in the following sections. (In case you’re interested in how this mechanism arrived in its current state, this material might answer your questions.)

Configuring Sleeps 

Once Hazeltest has started, all enabled runners will acquire a connection to the Hazelcast cluster under test and then start generating load by performing the interactions typical for the data structures they use in Hazelcast (inserts/reads/deletes for maps, put/poll for queues etc). Sleep configurations have been introduced for each kind of runner to enable users of the application to adjust the load generation to whatever they feel is realistic load on their clusters – after all, many business applications in organizations probably won’t hammer on their Hazelcast clusters continuously all the time.

The test loop for each runner works in terms of action batches and runs. For example, the current implementation of the map test loop executes three consecutive action batches (ingest all, read all, delete some, where some is a random number) in each run, where the batch size is equal to the number of elements in the data the runner works with (for example, the PokedexRunner works with the 151 items of the first-generation Pokédex, hence its batch size is 151). So, if the runner instantiating this test loop, for example said PokedexRunner, is configured with numRuns: 10000, it will execute these three action batches 10,000 times, after which it will signal to its caller that it’s finished. At the time of this writing, the following sleep configs are available for map runners and queue runners:

  • Sleep between action batches: Will make the test loop sleep after having performed one batch of actions. For example, if the duration for this is configured to be 2.000 ms, a map test loop will sleep for 2.000 ms each time it has finished an ingest, read, or delete batch (or put/poll batch in case of a queue runner).
  • Sleep between runs: Will make the test loop sleep before executing each set of action batches that constitute one run of the test loop. For example, in case of the map test loop, if the sleep is configured to last 5.000 ms, the runner will sleep that amount of time before executing its ingest, read, and delete batches.

In addition to that, an initial delay can be configured for the put and poll actions of queue runners (the idea being that, in a real-world scenario, there are usually two players involved in queue interaction, one which puts, and one which polls, and it’s fairly unrealistic for them to start precisely at the same time; rather, you’d probably expect the polling player to start a little later).

In terms of the runners’ YAML configuration, these sleeps can be configured as follows:

    # Using the TweetRunner as an example
            enabled: false
            durationMs: 2000
          # 'betweenActionBatches' and 'betweenRuns' analogous to map runner
            enabled: true
            durationMs: 20000
          # 'betweenActionBatches' and 'betweenRuns' analogous to map runner
    # Using the PokedexRunner as an example
        # No 'initialDelay' for map runners
          enabled: true
          durationMs: 2000
          enabled: true
          durationMs: 3000

Right now, there is no option to randomize the sleep durations, which would probably make the load generated on Hazelcast more realistic, and I would like to add this as a feature in the future.

Adjusting Data Structure Usage 

Earlier on, configuration flexibility was mentioned as one of the requirements Hazeltest must fulfill, and one dimension of this flexibility is the number of data structures used in Hazelcast. The underlying idea for how to make this happen in the application is very simple: By appending various identifiers to the name of the data structure (or omitting the identifiers), the number of data structures Hazeltest will create and use in the Hazelcast cluster can be drastically increased (or decreased). At the time of writing, there are two identifiers available to be appended to the data structure name: The unique ID of the Hazeltest instance creating the data structure on Hazelcast, and the index of the goroutine performing actions on the data structure.

Let’s see how this works by means of a simple example, again using the PokedexRunner as our object of configuration:

replicaCount: 1
# ...

      enabled: false
      enabled: true
      numMaps: 10
      appendMapIndexToMapName: true
      appendClientIdToMapName: false
      # ...

In this configuration, the LoadRunner is deactivated, so we can ignore its map settings. The PokedexRunner, on the other hand, is enabled and configured to spawn ten map goroutines. Because appendMapIndexToMapName is set to true, those ten goroutines will correspond to ten maps in Hazelcast since the runner will append the index of each goroutine to the map name (conversely, if this setting were false, even though the runner would still launch ten map goroutines, they would all be accessing the same map, so this would translate to only one map in the cluster). Meanwhile, appendClientIdToMapName is false, meaning the runners of this Hazeltest instance will not append this instance’s unique ID to the maps they use, so even if we had more than one Hazeltest instance running with the above configuration, the number of maps created in the Hazelcast cluster would still be ten as all instances would work with the same map names. On the other hand, if appendClientIdToMapName were set to true, each PokedexRunner in each Hazeltest instance using this configuration would work on its own batch of maps, thus significantly increasing the number of maps. In summary, with a replicaCount of 1 and only one runner active, the above configuration will create ten maps in the Hazelcast cluster.

Let’s see how we can drastically increase this number by tinkering around a bit with these properties:

replicaCount: 40
# ...

      enabled: true
      numMaps: 10
      appendMapIndexToMapName: true
      appendClientIdToMapName: true
      # ...
      enabled: true
      numMaps: 5
      appendMapIndexToMapName: true
      appendClientIdToMapName: true
      # ...

The most apparent change is that the number of replicas is much higher, but if the above configuration were identical to the first configuration apart from the replicaCount value, it would still give us very few maps (10 for the PokedexRunner and 5 for the LoadRunner). The trick is that this time, appendClientIdToMapName is true for both runners, so each runner in each Hazeltest instance will work on its very own batch of maps. The number of maps in each batch is controlled by the numMaps property. So, the above configuration gives us the following map counts:

  • PokedexRunner: 40 instances * 10 maps each –> 400 maps
  • LoadRunner: 40 instances * 5 maps each –> 200 maps

With a total of 600 maps, the load those Hazeltest instances will put on the Hazelcast cluster is quite a lot higher compared to the first example configuration. Thus, simply playing around with the number of Hazeltest instances, the number of maps used for each runner, and the two append* properties yields vastly different results in terms of the map usage on the Hazelcast cluster those instances work with. (The examples above were made with map runners, but the append* properties as well as how they play out in the runner behavior work in exactly the same way for queue runners, too.)

Short-Running Versus Long-Running Tests 

The following example will introduce the map runner config’s numRuns property in the context of configuring how long a test runs (the example is made for maps again because they are more often used in Hazelcast than queues, but the concept works in the same way for the queue runners with the exception that the queue runners’ configs embed these properties in a dedicated putConfig and pollConfig). Let’s use our ol’ reliable, the PokedexRunner, again, and disable all other runners:

      enabled: false
      enabled: false
      numMaps: 1
      numRuns: 2
          enabled: false
          enabled: true
          durationMs: 5000
      enabled: false

To make the logging output of the runner as straightforward as possible, I’ve given it only one map to work with, but the important piece of configuration here is the numRuns: 2 property in combination with the sleeps.betweenRuns config. They will produce a fairly short test duration of roughly 15 seconds, caused by two sleeps of 5 seconds each plus the time it takes for the map operations to complete. The following two excerpts from the Hazeltest Pod’s logs show this (watch out for the time field at the end of the line):

{“caller”:“/app/maps/testloop.go:51”,”client”:“f361035e-7ef9-4989-87fd-1b0508c382bb”,”dataStructureName”:“ht_pokedex-0”,”file”:“/app/logging/logging.go:142”,”func”:“hazeltest/logging.(*LogProvider).doLog”,”kind”:“timing info”,”level”:“info”,”msg”:“‘getMap()’ took 5 ms”,”operation”:“getMap()”,”time”:“2022-09-23T16:54:46Z”,”tookMs”:5}


{“caller”:“/app/maps/testloop.go:111”,”client”:“f361035e-7ef9-4989-87fd-1b0508c382bb”,”file”:“/app/logging/logging.go:142”,”func”:“hazeltest/logging.(*LogProvider).doLog”,”kind”:“internal state info”,”level”:“info”,”msg”:“map test loop done on map ‘ht_pokedex-0’ in map goroutine 0”,”time”:“2022-09-23T16:55:00Z”}

(You could make the test even shorter if you were to disable the betweenRuns sleep – I’ve only activated it here to generate a more visible difference between the two timestamps.)

Similarly, to achieve a very long-running test, simply adjust the numRuns property upwards to whatever you need – in the default config, the value is set to 10,000 for the two available map runners, which, assuming the default sleep configs, will keep the runners busy for a couple of hours. In case you still need more, though, the internal property representing the number of runs is a uint32, so you can provide a maximum value of 4,294,967,295, which won’t produce an endless loop as such, but one running long enough it might just outlive you.

Load-Generation Example: Scenario 1 

So, with all the theory and basics out of the way, let’s finally get our hands dirty and put Hazeltest to some good use! The following paragraphs will introduce you to the first of three scenarios, each of which comes with its own configurations for both Hazelcast and Hazeltest.

Each scenario links to a subdirectory of the blog-examples repo containing values.yaml files in ascending order of usage (meaning you can also view the files for scenarios 2 and 3 there in case you wish to jump ahead, but please keep in mind these files are still work-in-progress). The hazeltest/hazeltest_livestream_1/scenario_1 directory contains all files we’re going to need in the following paragraphs.

The following is an excerpt from the Hazelcast configuration we’re going to start with (full file here), showing only the pieces of configuration relevant for this scenario in order to keep things short and simple:

    backup-count: 0
    max-idle-seconds: 30
    backup-count: 1
    max-idle-seconds: 600
      # ...
    backup-count: 2
    max-idle-seconds: 120
      # ...

So, we have two application-specific map configs and one default config that attempts to protect the cluster from misbehaving clients (i.e. clients requesting maps that have not been explicitly configured) by setting a low max-idle-seconds value and not providing them with any backups for their map entries.

And here’s the challenger on the other side of the ring (again showing only the relevant bits, this one is the full file):

replicaCount: 2

# ...

  # ...
      enabled: true
      numMaps: 10
      appendMapIndexToMapName: false
      appendClientIdToMapName: false
        enabled: true
        prefix: "app_1_"
        # ...
      enabled: true
      numMaps: 10
      appendMapIndexToMapName: false
      appendClientIdToMapName: false
        enabled: true
        prefix: "app1_"
        # ...

Note how both append* properties have been set to false for both runners, so with a replica count of 2 and 10 map goroutines, we’ll still end up with only two maps in the Hazelcast cluster (granted, the naming of the numMaps property might be a bit misleading). Also, you’ve probably seen the deliberate error in this configuration: The map prefix for the PokedexRunner is app_1_ rather than app1_, and since there is no such map name in our Hazelcast config, the default config will apply for this map.

Let’s start Hazelcast and then generate some load with Hazeltest (because Hazelcast is deployed as a StatefulSet and I’d like to keep the time for re-creating a clean state as short as possible, the Hazelcast cluster only has one member, but you can, of course, adjust that to whatever you like):


# Install Hazelcast with Management Center
$ helm upgrade --install hazelcastwithmancenter ./hazelcastwithmancenter --namespace=hazelcastplatform --create-namespace

# Install Hazeltest once the Hazelcast Pod has achieved readiness
$ helm upgrade --install hazeltest-app1 ./hazeltest --namespace=hazelcastplatform

This should give you roughly the following:

# As always
$ alias k=kubectl

# Query state in namespace
$ k -n hazelcastplatform get po
NAME                                       READY   STATUS    RESTARTS   AGE
hazelcastimdg-0                            1/1     Running   0          14m
hazelcastimdg-mancenter-67d8b898b4-bvmvc   1/1     Running   0          14m
hazeltest-app1-856d9dc74b-447cv            1/1     Running   0          12m
hazeltest-app1-856d9dc74b-xbjl9            1/1     Running   0          12m

As you can see there, the Pods have been running for a while in my case, which means our Hazelcast cluster seems to be doing pretty okay. Let’s check the Management Center (exposed through a NodePort-type Service) in order to verify all is well:

Indeed, the cluster is doing fine, with the overall memory used for maps staying at roughly 80% of the available heap. Let’s also check out the maps those two Hazeltest instances have created:

So, while the two maps are not that far apart in terms of their entry count, the app_1_pokedex map occupies far less memory. The ensuing question, then, would be this: Is the default map configuration restrictive enough to protect the Hazelcast cluster if a client writes more and larger data?

To find out, here’s another Hazeltest file we’re going to apply shortly (only showing the relevant differences to the first Hazeltest file – you can find the full file here):

    # ...
      enabled: true
      prefix: "app_2_"
      # ...
      enabled: true
      prefix: "app_2_"
      # ...

In other words, another misbehaving client makes its appearance, and this time, the LoadRunner will catch the default map configuration, too. Once Hazeltest has been installed with this configuration (note the different release name!).

# Deploy as 'hazeltest-app2'
$ helm upgrade --install hazeltest-app2 ./hazeltest --namespace=hazelcastplatform

# Setup watch for Hazelcast Pod so we can see if the member crashes
$ watch kubectl -n hazelcastplatform get po --selector="app.kubernetes.io/name=hazelcastimdg"
NAME              READY   STATUS    RESTARTS      AGE
hazelcastimdg-0   1/1     Running   0             54m

… the Hazelcast cluster will start to struggle:

The heap usage increases, and so, inevitably, the cluster member crashes with the infamous OutOfMemoryError:

$ watch kubectl -n hazelcastplatform get po --selector="app.kubernetes.io/name=hazelcastimdg"
NAME              READY   STATUS    RESTARTS      AGE
hazelcastimdg-0   1/1     Running   1 (41s ago)   57m

# Check logs of previous Pod
$ k -n hazelcastplatform logs hazelcastimdg-0 --previous
# ...
java.lang.OutOfMemoryError: Java heap space
# ...

As it turns out, then, the default map configuration wasn’t quite defensive enough. To improve on this, we could specify a lower value for max-idle-seconds and provide an eviction policy (excerpt from this file):

    backup-count: 0
    max-idle-seconds: 5
    in-memory-format: BINARY
      eviction-policy: LRU
      max-size-policy: FREE_HEAP_PERCENTAGE
      size: 85

Once you’ve updated the values.yaml file for Hazelcast, simply install a new revision for the Hazelcast release (the Hazelcast StatefulSet’s Pod template contains the Helm revision in an annotation, so with every revision update, a new Pod will be created):

# Install new Hazelcast revision
$ helm upgrade --install hazelcastwithmancenter ./hazelcastwithmancenter --namespace=hazelcastplatform

As long as the Hazelcast Pod hasn’t become ready yet, the Hazeltest instances will complain they’ve got no Hazelcast cluster to connect to, but that’s fine – as soon as the Hazelcast Pod has achieved readiness, they will reconnect and get back to their load generation tasks, the results of which we can soon see in the Management Center:

After some time, the Hazelcast Pod still looks pretty good:

# Watch from earlier
$ watch kubectl -n hazelcastplatform get po --selector="app.kubernetes.io/name=hazelcastimdg"
hazelcastimdg-0   1/1     Running   0          11m

This concludes scenario 1, delivering a cluster config suitable for the load generated by Hazeltest thus far. You can run the following commands to clean up all the resources created thus far:

# Uninstall Hazeltest ("hazeltest-app1")
$ helm uninstall hazeltest-app1 --namespace=hazelcastplatform

# Uninstall Hazeltest ("hazeltest-app2")
$ helm uninstall hazeltest-app2 --namespace=hazelcastplatform

# Uninstall Hazelcast
$ helm uninstall hazelcastwithmancenter --namespace=hazelcastplatform

# ... or simply get rid of the namespace altogether
$ k delete ns hazelcastplatform


We’ve seen that even when you’re not a software engineer, it appears to be useful to put yourself in the shoes of one with regard to testing release candidate fitness because the paradigm thus assumed delivers a goal very worthy of pursuit: To build powerful automation for testing whatever release candidate you’re responsible for in order to generate quick, reliable, and repeatable feedback on its fitness for production in terms of a given requirement set.

My current project team and I are responsible for making sure our organization’s Hazelcast clusters are up to the challenge, no matter what the various business applications might throw at them. Here, the release candidate is a Helm chart that will create the Hazelcast cluster on Kubernetes, and while it’s pretty simple to verify the syntactic correctness of the chart, asserting the correctness of the Hazelcast cluster configuration it defines is an entirely different story.

The testing means my team and I applied thus far have not delivered satisfying results, and Hazeltest is my humble attempt to deliver a better, more useful means; one that is more closely aligned with the goal of automating the testing process or, at the very least, automating the process of generating different kinds of load. Hazeltest is open-source, implemented in Golang, and its basic idea is that its internal so-called runners create realistic load on the Hazelcast cluster under test, where each runner can be flexibly configured so teams responsible for testing Hazelcast in organizations can fine-tune the load generation behavior to their needs. Thus, Hazeltest focuses on testing the Hazelcast cluster itself rather than the release candidate – some bundled set of configuration, possibly in a standardized format like a Helm chart – that it was brought forward by.

The practical examples in the preceding sections introduced you to the basics of operating Hazeltest by showcasing its most important configuration aspects and using the application in scope of a simple load-generation scenario. In this scenario, we’ve configured Hazeltest in such a way that its load runner revealed the default map config was not defensive enough to protect the Hazelcast cluster from misbehaving clients that write too much data too quickly.

We look forward to your feedback and comments about this blog and Hazeltest. Don’t hesitate to share your experience with everyone in the Slack community or Github repository. You can reach Nico Krieg on LinkedIn.


We are happy to announce a couple new releases, along with some exciting Beta releases that we have shipped recently as part of Hazelcast’s real time data platform!  In case you missed it, we’ve also announced the Hazelcast Viridian Serverless data platform! You can read the following posts to learn more about these exciting new releases:

As always, please don’t hesitate to join the Hazelcast Community on Slack if you have questions, or if you want to provide feedback. Also, feel free to check out our community page for other options.

Happy Hazelcasting! 🙂

Hazelcast Management Center 5.2-Beta-1

New Features

  • Added the following for SQL Browser:
    • separate workflows for batch and streaming queries.
    • editor autocomplete support.
    • connector wizard for Kafka connector.
    • connector wizard for File connector.
  • Introduced customizable tables, in terms of selected columns and order.
  • Introduced cluster connection diagnosis functionality.
  • Added statistics table for map indexes.
  • Added a chart for Tiered Storage usage monitoring.
  • Added support for Command-Line Client (CLC).


  • SQL Browser improvements:
    • now you can get MapBrowser information for a record in SQL query result.
    • schema panel now can be hidden/shown.
    • introduced a dialog explanation why suggested mapping cannot be provided.
    • increased speed of schema tab population.
  • Updated RocksDB version to support Apple M1.
  • Increased log level for Jetty classes that log HTTP request information.
  • Introduced a property to configure SQL export cell size limit.
  • Improved logging for RocksDB filesystem permission errors.
  • Improved Time Travel usability.
  • Time Travel is now disabled when metrics persistence is disabled.
  • Proper error message when Console is disabled on member.
  • Removed excessive Spring framework warnings.
  • Management Center no longer starts RocksDB for metrics persistence on unsupported operating systems (currently z/os).
  • Added progress indicator for the cluster remove operation.
  • TTL and maximum idle in Map Browser are shown as “Unlimited” if set to 0 or more than 1000 years.


Hazelcast Command-Line Client (CLC) 5.2-Beta-1

New Features

  • Introduced Windows AMD64 executable and Windows installer.
  • Added the “version” command for easy troubleshooting with the CLC version, Hazelcast Go Client version, Go version, and the latest git commit hash. See the command documentation.


  • Fixed various issues that caused the CLC to get stuck and enhanced the visual UI.
  • Added a new UI widget to indicate that a query is running.
  • Added color themes and the ability to override colors on SQL Browser. See the documentation for related settings.


Hazelcast JDBC Driver 5.2-Beta-1

New Features

  • Added support of basic database metadata.
  • Added support of query resubmission.


Hazelcast Platform 5.1.3

This release of IMDG improves connection handling.

Hazelcast Management Center 5.1.4

This release of Management Center fixes several Hazelcast Cloud integration issues.

Hazelcast Platform Operator 5.4

New Features

Added support for the following features and services:

  • WAN replication
  • Entry Processors
  • MapStore and MapLoader
  • Executor service


  • You can now load client-side classes, upload JAR files using ConfigMaps, and download JAR files into the Java classpath.
  • Introduced a reference to the Map resource so that the resource is updated with configuration changes for WAN replication.
  • When a Hazelcast CR is marked for deletion, now all dependent CRs are also deleted.
  • Added support for stopping hot backups gracefully.
  • Added configuration to whether skip the checks for the Management Center lock file.
  • Added phone home data collection as a separate process.
  • Added phone home data collection for map CR, WAN replication, user code deployment, and backup and restore.


Hazelcast C++ Client 4.2.1

This release of the Hazelcast C++ client fixes various API issues, mainly regarding large messages and lock mutex.

Hazelcast Management Center 3.12.18

This release of Management Center fixes a Spring Framework Denial of Service (DoS) Data Binding Vulnerability.

Hazelcast IMDG 3.12.13

This release of IMDG improves connection handling.