What CAP Theorem Means to a Business Leader Dan Ortega February 28, 2019 Share The data we continuously generate and use operates on an incredibly vast scale (think of Google, Amazon, Facebook, that level of data). Because of the breadth and depth of infrastructure required to stream incoming data and execute against it (and to avoid single points of failure), the ingestion and processing of data is distributed across systems in nodes, which group together to form clusters. The clusters can form and reform (or partition) the nodes that comprise them, based on service demand, network latency, system availability, etc. This forming/reforming normally occurs automatically, and the intent is to provide a continuous and smooth experience for the end user, with minimal disruption. Now, this is technology we’re talking about and things can and will go wrong. Network connections can drop, nodes can fail, storage can be corrupted, all sorts of variables are at play. The way distributed information systems are designed, particularly at this scale, have to factor in that there is going to be instability in the system. One scenario that is reasonably common is having a node or nodes drop out of a cluster, which can be caused by (e.g.) a failure in the network devices. When this happens, the cluster automatically reforms (or partitions) itself into smaller clusters with the remaining nodes on their side of the network split, or network partition as it is commonly known. The problem with this is that the nodes on either side of the split think that they’re the only ones left (and they’re not), then you have two or more clusters serving the same data to users, which leads to data divergence during updates and potentially stale data being read. The manifestation of this from a business perspective would define what type of response is required to the end user when the system is under duress. Is accurate or the latest information more critical than available information? If I’m checking my bank balance, accuracy (or consistency) would be pretty significant – I’d prefer “no information available” to inaccurate information. On the other hand, if I’m checking Twitter feeds, I’d be more interested in availability because accuracy isn’t quite so critical (the numbers move around a lot anyway due to the design of distributed systems and how requests hit servers). To a great extent, businesses need to decide on what the quality of the data is defined by and whether consistency or availability is more important, and that, of course, depends on your business. There are ways to design around this, but like everything else in life, there are tradeoffs. In the case of how distributed information systems are designed – remembering that how its designed affects your customers and end-users’ experience – you essentially have a trade-off when failures happen within your network (which triggers a Partition), you then get to choose between Consistency (is it accurate) or Availability (is it available). These three items (Consistency, Availability, and Partitioning) form what is referred to as the CAP Theorem, and it’s one of those developer-level details that touches everything we do, pretty much all the time. When we are talking about a distributed system under duress, Partitioning becomes a constant, and then your systems architect is faced with the choice of AP (available when partitioned) or CP (consistent when partitioned). As the British philosopher Sir Michael Philip Jagger once said “You can’t always get what you want,” and this applies to how CAP Theorem has always worked. Designers are always forced to compromise in one direction, and there is always a tradeoff. Until now. Hazelcast, which has had a long-standing presence in the in-memory market, has recently announced the availability of a solution that supports both AP and CP within the same system (an industry first). This includes a Consistency (CP) Subsystem for sensitive concurrency structures which favors consistency over availability, as well as a large set of data storage structures which prefer availability over consistency (AP). Having both subsystems available in-memory means customers can now fine-tune the in-memory data grid to suit the application’s requirements. This provides a much higher level of flexibility in terms of deployment (and is always a better option than either/or), which leads to more efficient resource utilization, lowered operating costs, less disruption to end-users and customers, etc. To a non-technical user, the impact may be nominal. To the people who keep these systems running optimally, this is a huge deal. Bottom line? If this is a big deal to your technologists, it’s a big deal to you.