Hazelcast Anti-Patterns

It’s said a wise person learns from their mistakes, and a wiser person learns from the mistakes of others. In this blog, we’ll look at some common mistakes made when deploying Hazelcast, so you can avoid them and be that wiser person. In other words, what are the “anti-patterns” that don’t ensure failure, but do make problems more likely?

Here we’ll look at the top 5. The good news is it’s not a top 10. The following 5 are the ones we see the most.

1. A 2-Node Cluster

Hazelcast is a distributed system. You run a “cluster“. A cluster is a collection of Hazelcast processes that provide a collective service for data storage, data processing and data streaming. If one process goes offline, for maintenance or due to a failure, the rest take up the strain and the collective service continues at reduced strength.

Clearly, a single Hazelcast node is not a cluster. If it goes offline you have no service.

Everyone grasps this concept !

A cluster needs more than a single Hazelcast node, but how many? Two? Three? Four? Five?

Two is the most obvious number if you realize one is stupid. Two is the next size larger than one.

What’s wrong with 2 nodes?

A 2-node cluster has no more capacity than a 1-node cluster.

This is because the default configuration for data safety is for 1 backup (and you don’t want to change this default unless you really know what you are doing).

So if you have 1,000,000 data records and a 1-node cluster, that 1-node must hold all of them obviously. Although you would want backups for data safety, with a 1-node cluster there is nowhere (read: nowhere else) to put them.

If a 2nd node is added, making a 2-node cluster, there is now somewhere for those backups, and the second node is filled.

To recap, with 1-node, there are 1,000,000 primary copies of the data. With 2-nodes, there are 1,000,000 primary copies and 1,000,000 backup copies. Each node in a 2-node cluster has as much capacity as a 1-node cluster, for default backup configuration.

What’s the answer here?

Three at least.

If you have a 3-node cluster, you have more capacity than either a 2-node cluster or a 1-node cluster. You actually have somewhere else to store data, so capacity increases. For commercial Hazelcast customers, 3 is the minimum supported number. We’ve seen attempts to trim costs by going smaller cause problems time and time again.

Once more to reiterate

You gain resilience but not capacity. A 2-node cluster is not worse than a 1-node cluster, but to the surprise of many is not much better either.

Imagine any node has enough memory to store 2 data records. This would be silly small, but it’s just an illustration. In a 1-node cluster, that one node can store 2 data records. These can only be the primary copy since there is nowhere else for backups. So that 1-node cluster can store records “A” and “B“.

In a 2-node cluster, backups are possible. One node will have the primary copy of “A” and the backup copy of “B“. The other node will have the reverse. Both nodes are full, the cluster only stores 2 data records. You have resilience if either node goes offline, which is a good thing to have, but no extra capacity.

2. A 1-Host Cluster

Another common situation is to run all Hazelcast nodes on the same host. Hosts cost money, and once we’ve grasped that we need several Hazelcast nodes to form a cluster, why not run them on the one host to keep costs under control.

What’s wrong with a 1-host cluster?

Two things, performance and resilience.

Performance is worse when running 2 Hazelcast nodes on the same host compared to running 1 larger Hazelcast node. These nodes are running processing on the operating system, if they have to compete for resources you’ll lose performance due to context switching between the two processes or with contention on the network card.

Resilience isn’t much improved. It’s true that one process can fail and the other stays running. However, a process failure for something like a Java HotSpot error is pretty rare, and remember also that the other process will be using the same Java runtime, so if this fails for one it could fail for the other.

More relevant is if the machine has a fault, such as a blip on the power, it’ll take both processes out at once.

What’s the answer here?

The best practice is to run each Hazelcast node on its own machine. All machines should have the same specification, CPU, RAM, etc.

At least for production, dedicated hardware is a worthwhile spend.

3. A Spanning Cluster

Once we realize that hardware can fail, we want to mitigate this risk. Let’s run half our Hazelcast cluster in one data center and the other half in another data center. Even if the data center catches fire our service can maybe survive.

Here we are of course assuming our cluster has sufficient capacity that it can run with half of it missing and still hit all SLAs. It’s good if it can, but it’s an abundance of resources that someone may someday look to cut back upon. However, that’s a problem for a different day.

What’s wrong with a spanning cluster?

Unfortunately, we have to cope with the speed of light.

Networking is usually described in terms of speed that is really throughput. A “10GBps” network can send ten times as much data in the same time as a “1GBps” network. If you have a 1Gb to send, both will send it in 1 second. The “10GBps” can’t send that 1Gb in a tenth of the time. What dictates the time to get to the receiver is the distance.

Throughput is important but latency and reliability are factors.

For Hazelcast, if a data item is updated and there’s a backup copy, the backup copy needs to be updated too obviously. For this update to be considered done, the change has to be applied on the Hazelcast node with the primary copy, sent to the Hazelcast node with the backup copy and applied on the Hazelcast node with the backup copy.

If this sends from one Hazelcast node to another is permanently slow, because of latency, then that synchronous replication mechanism won’t be viable.

Constant latency means the distance between the datacenters is far. You can’t send the data quickly, even if it’s a small amount. This is a permanent problem, fairly easy to spot.

Latency can also be a temporary problem. The connection between the data centers will likely be shared, and may occasionally be slow due to other traffic. This is worse because it may be fine when you measure it, and that extra shared load may be added in the future. A network that is 99% reliable is as much use here as a network that is not reliable.

ISO22301

Typically you know where the datacenters are, so know if the distance is excessive or not. Excessive might mean 50 miles or 80 kilometers.

So you’re good? Well, maybe.

The problem is here that these are IT assumptions and there may be external dictates.

The board of directors may dictate a data center has to move, to ensure business continuity, along the lines of ISO22301. For business continuity, you want the data centers far apart. If this happens, that’s the idea of a spanning cluster broken. It would seem reasonable to assume that at some point business continuity will be raised.

Also, network performance can go down. It’s just a cost, and someone may decide a cheaper shared network between the datacenters may be a worthwhile cost saving.

What’s the answer here?

This isn’t a Hazelcast problem. All clustered systems can’t support synchronous replication across a long distance.

Hazelcast’s solution here is to run two Hazelcast clusters, one in each location.

These can be connected for asynchronous data sharing. Changes on one cluster are sent to the other cluster as and when the network permits. This send-when-possible mechanism is tolerant of slow or temporarily broken connections.

Niagara Falls

Waterfalls are a good way to think of the difference between throughput and latency.

For Niagara Falls, typically 2,000,000 liters of water (600,000 gallons) go over the edge every second. This is the throughput.

The drop from top to bottom is 50 meters (160 feet). Water takes 3 seconds to fall. This is latency.

If it rains, more water goes over. Increased throughput. It takes the same time to fall, unchanged latency.

This is how it is with networking. A 10 Gb/second network can send 10 times as much data as a 1 Gb/second network. But how long each data packet takes to deliver won’t be any faster, it’ll depend on the distance.

4. Sizing for Failure

In the olden days of mainframes, the goal was to stop failure. You would have uninterruptible power supplies, and all sorts, to make sure that one resource never failed.

On a distributed system such as Hazelcast, you use multiple hosts.

Decent hardware might have a mean-time-between-failure of 1000 days. The host machine is expected to fail once in every 1000 days. Let’s be generous and say once in three years.

If you run a 9-node cluster, that’s 9 hosts (following the advice above) so 3 host failures a year.

Meaning, the production system with 9 hosts will face hardware faults 9 times in 3 years.

So failure has to be something you expect and embrace, rather than something to circumvent. The more machines, the more frequent any one of them might let you down.

What’s wrong with sizing?

Going for more expensive machines might reduce the failure frequency. Say now once in every 6 years instead of every 3 years. But if your cluster doubles in size due to the increase in business you would welcome, you’re back to 3 failures a year.

What’s wrong with sizing is that it rarely makes an allowance for failure, it’s rarely considered or frequently dismissed.

You have “x” Terabytes of data, which needs “y” machines, so buy “y” machines. Then you’re stuck.

What’s the answer here?

The answer is easy. Figure out how many nodes you need to run to hit your SLA, then run 1 or 2 more than that normally.

You should expect a host to fail, it’s only hardware after all. If one fails, and each Hazelcast node is on its own host, you’ve only lost one Hazelcast node. If your cluster size was one more than needed to hit your SLA, and one host is lost, you’re now at the level to hit your SLAs. SLAs are maintained, your boss is happy.

Hosts can fail, and at some point, you will be bouncing nodes for maintenance. If you are particularly unlucky, a host will die at the same point as you’re bouncing a node. So your cluster is missing 2 nodes. If you were originally running with 2 nodes more than needed, you now have exactly the number needed to hit SLAs, happy days. If you were originally 1 node more than needed, you’re now 1 under the number needed to hit SLAs, time to polish that resume.

5. Not Planning for Security

Security is obviously important, but it’s a non-functional requirement. Non-functional requirements are normally dealt with towards the end of development.

After all, there’s no point in spending development time to secure access to a system that doesn’t work.

What’s wrong with adding security later?

Rework!

If you do something like “map.get(K)” it’s ordinarily a binary operation. You try to retrieve a data record and might have success (it exists!) or failure (it doesn’t exist).

When security is added, there is now a third outcome “ACCESS DENIED“.

All code that has decision logic assuming two outcomes is now wrong.

What’s the answer here?

This is not about when security is implemented, but when it is designed. It must be considered from the outset.

If your coding copes with security rejections, it’ll work when there is no security (eg. in DEV). So it’s no worse off.

In most projects, security is added in later lifecycle stages. In UAT prior to final testing. But if your code has to be re-done the total time for the project to complete will be longer.

Summary

So, there we are… 5 common misconceptions with an explanation for why people normally pick them and for why they’re bad choices.