Making Your Data Safe from Loss

Hazelcast is primarily a memory store, and memory is widely understood to be volatile. If the power goes off, data in memory is lost. Disk is viewed as the safe option; the data will still be there after a power cut. We shall look here at what “data safety” means, and why both statements are flawed.

As we’ll see below, disk is not as safe as you think, and memory can be safer than expected. Once you understand this, you should be confident to design a safe system, with knowledge of the errors that can be recovered from and those that can’t.

Data safety

The first thing to define is “data safety“, meaning data that won’t be accidentally lost. It is retained until you explicitly delete it, or until system housekeeping is allowed to delete it.

This doesn’t need to apply to all data.

  • If you “lose” a bank statement, you can regenerate it easily if you have the transactions.
  • If you “lose” the current price of Bitcoin, it changes so rapidly there will be a new current price from the outside world almost immediately.
  • If you “lose” a user’s session, they can just login in again.
  • If you “lose” reference data, you can reload it from the source.

The above are cases where you could tolerate losing data but might prefer not. In general, most data doesn’t accommodate the above escape clauses. You want to keep it, as recreating it may not be possible or take too long. Data safety for data in memory, data on disk, data on a tape, etc means data that is highly unlikely to be accidentally lost.

“Highly unlikely” is good enough

Imagine an offline backup in a fireproof safe. This data backup is reasonably protected, but it’s not perfect. For a start, a fireproof safe is not fireproof. It is “fireproof for a while” – perhaps an hour. If the fire isn’t extinguished within the hour, the contents are damaged. It would be reasonable to assume the emergency services can get to your datacenter within an hour, but perhaps they run into delays. If the fire is large and affects residential buildings, they’ll get priority over a datacenter.

Our data is safe from the unlikely small fire, but not from the very unlikely big fire. If we decide this isn’t good enough, we put another offline backup in another fireproof safe somewhere else. Now our data is safe from the very unlikely big fire in one location and simultaneously another very unlikely big fire in the other location. Duplication lowers the odds of loss, but can never eliminate all combinations, however unlikely.

One copy is good. Two is better. Three is better still. Four is better than that. But we can never get to 100%. If our statistically analysis is correct, a series of random, unconnected events would be so unlikely that we can consider the change of total loss to be zero, we will always have at least one copy to fall back upon. This doesn’t cover planned attacks or that our analysis of the odds may be wrong!

Fire is just an example. Other disasters are available.

Don’t forget the cost

In the section above, three copies are better than two copies from a data safety perspective. You are less likely to lose three than two. However, in other ways three copies are worse than two copies. Firstly, with three copies there is 50% more infrastructure. Financial costs may be 50% higher. Secondly, with three copies there is a higher performance cost to writing than two copies. Writing takes longer, but that’s maybe not as significant for applications with low proportions of writes compared to read.

Improving data safety makes other things worse.

Data safety recovery

Based on the above, data safety is simple. Always keep enough copies of your data that you’re not worried about losing them all. Pay attention to “always keep.

Imagine your decided 2 copies provides data safety. Some catastrophic event may cause a copy to be lost. Data hasn’t been lost, you have another copy. But data safety has been lost, now you have less copies remaining (1) than you determined necessary (2). So to recover data safety needs to duplicate the remaining copy to a third location. If it was a fire that destroyed the first location, then that can’t be used. Having two copies in the second location provides no protection from a fire there.

Disk safety

Disks are fairly permanent stores, based on magnetic technology, optical technology or similar. Old data may degrade and be unreadable after a few years, but we can think of this as permanent. What we write to the disk we can read back in, even if the power has gone off in between.

The fallacy of a disk write

Unfortunately, when our software does a write to a disk file, it often doesn’t directly go to the disk. Instead, it goes to the disk controller, an intermediate module that collates writes in a buffer and flushes them in a block to improve performance.

Our software may make 5 operating system calls to write lines to a file. But behind the scenes the disk controller may have written the first 4 in a block, and buffering the last to write momentarily. If there is a crash at that point, only the 4 writes that have actually gone to disk are safe, the last is lost.

Naturally we can turn this buffering off. Safety is improved, but still is not perfect if the disk catches fire. And now performance is much worse. All we have done is swapped one problem for another, nothing to be proud of.

How many copies ?

Single and double copies are common configurations.

1 copy

“Local” disks are often used in the 1 copy configuration. There is 1 disk, 1 copy of each data record. That disk may be physically inside the host machine. It’s adjacency helps with data transfer time, it is a short cable. If a process crashes while writing 5 lines to a file on that disk, perhaps 4 or 5 lines actually were written to disk. When we examine the file afterwards, we might see 4 or we might see 5 lines. If the host machine catches fire, the disk may be destroyed. Our only copy of the data is held is lost, the entire file.

2 copies

Another configuration is a disk array, such as RAID. Typically 2 disks act as one. At least one will be distant, on a longer cable or across a network. When we write a line of 80 bytes, the same line is written to both disks. The file size might report as 80 bytes, but it needs 160 bytes. If a process crashes while writing 5 lines to a file, perhaps 4 lines had been written to one disk and 5 to the other.

If one disk catches fire, we still have the other, we have one copy of that file.

Recovery from a disk crash

In the 2 copy scenario, if a different number of lines have been written to the 2 disks that pretend to be 1. One has had 5 lines written, one has had 4 lines written. It’s pretty easy to reconcile, add the missing line and both copies are aligned.

Disk safety recap

A process that crashes while writing to disk may not have saved all content to all disk copies. This may be recoverable depending on the number of disk copies. A disk that crashes may lose all data on it. This may be recoverable depending on the number of disk copies. Disks can therefore lose data. One alone does not provide data safety.

Memory safety

If you’ve followed the above, you will have realized that memory safety and disk safety are essentially the same problem.

Memory and disk fail in different ways, but the solution (multiple copies!) is the same.

Differences in failures

Memory and disk are exposed to overlapping failure scenarios. Fire would affect both. A power loss might have different degrees of severity. For disk, the most recent content is lost. For memory, everything is lost. Memory is impacted more, but disk is still impacted, so the problem needs solved for both.

Memory safety recap

Memory is made safe by duplication. Data in the memory of one process is duplicated in the memory of other processes. The more these processes can be independent the higher the safety. So each host machine might be in a different location, use a different power supply, etc. For Hazelcast, it’s as simple as specifying a backup-count parameter. This defaults to 1, all data has one backup.

Hybrid solutions

It’s worth noting at this point that safety is achieved with backups. A memory copy may have a disk backup. If these are in different places, safety is higher.

Summary

Duplication of data in separate locations provides data safety. If you anticipate that X copies can be lost at once, you need to have (X + 1) copies in (2 * X + 1) locations. If you anticipate that 2 copies can be lost at once, you need to have 3 copies in 5 locations. If 2 locations are lost, the 1 remaining copy can be used to recreate the lost 2 copies in the 2 remaining locations.

The above is true of disk, the above is true of memory. Both are intrinsically unsafe, though this is less obvious for disk. Duplication makes either safer, until we are safe enough.

Increasing the duplication level helps with data safety and hurts in other ways.