Experiences with Change Data Capture
In a previous blog post, Designing an Evergreen Cache with Change Data Capture, Nicolas describes “one-way” change data capture (CDC). It is a one-way process. In Nicolas’ post, the database copy can change and CDC pushes the change to Hazelcast to align the two copies of the data.
Let’s take this a stage further, with “two-way” CDC. Now data in Hazelcast can change as well, and we need to push those changes back to the database. In terms of coding, this seems really easy, but it actually brings a lot of significant problems.
Let’s look at what they are, and how they can be solved.
The Main Problems
Based on integrating Hazelcast with Debezium CDC to Cassandra, MongoDB and MySql, we’ve found some main problem categories and possible solutions.
What we mean by CDC, is a storage system such as a database somehow makes us aware of changes that have been applied to data records. These changes can then be applied to another copy of the same data elsewhere, to keep the two copies matching.
CDC is essentially operating off a log of changes that have been applied to System A and passing the same to System B.
It’s a one-way process. System A is not aware of System B. “Two-way” CDC really just means two unconnected one-way CDC setups, to pass changes in both directions.
It is an asynchronous process. CDC is aware of the change to System A after it has occurred and then updates System B. Until this occurs the data deviates between System A and System B.
It is a subsequent process. CDC is aware of the change to System A after it has occurred, and so cannot abort the change.
CDC in Hazelcast
Hazelcast maps can have a MapStore.
Changes made to Hazelcast data are passed, depending on configuration, immediately or on a deferred basis to a
store(K, V) method to duplicate the change to an external storage system.
Nicolas’s post covers Debezium, which supports many of the common database technologies. Other CDC systems are similar, but the devil is in the detail.
For MySql, Hazelcast uses Debezium libraries to remotely read directly from the database.
For Cassandra and Mongo, an intermediate process is used. A Debezium process reads the CDC log and writes the records to a Kafka topic. Hazelcast has a Kafka connector, so can read the changes in from the Kafka topic.
Previously when data changed, there would be some factors in the change of interest.
Obviously “what” changed is important, plus perhaps “when” and by “who“.
Now, we also need to consider “where“.
Harking back to Nicolas’ system, changes were made to MySql and reflected in Hazelcast. So the provenance (“MySql”) is implied, as that’s the only place where changes are made.
Now we allow changes to be made in Hazelcast as well, so cannot make that deduction.
If we find some bad data in Hazelcast, we now need to consider whether the corruption occurred due to application logic manipulating the data in Hazelcast or application logic manipulating the data in MySql.
If we knew which system was responsible for data corruption, the debugging effort would be halved.
The Meta-Data Solution
The idea here is simple. The CDC record may include some meta-data that allows us to reliably deduce where a change originated.
For example, for an SQL database, the CDC record might capture the JDBC driver’s version for the originator of the change.
If the driver version string used by Hazelcast’s connection to the database differed (or was made to differ) from the driver version string used by other SQL clients, we could deduce the provenance.
Why the “meta-data solution” is a poor solution
Typically, the CDC log will not capture this sort of meta-data. Part of the objective of the CDC log is to be a source to rebuild a corrupted database. It may not contain meta-data unnecessary for this purpose, to help keep the CDC logs compact.
Furthermore, the CDC log format can be specific to the database version. It may change from release to release, and the vendor may remove meta-data fields that we happen to rely upon.
The Data Solution
Again, the idea is simple. We add fields to each data record to enable us to determine the source of the change.
JPA style SQL systems often use these 4 auditing fields,
String createdBy; Long createdDate; String lastModifiedBy; Long lastModifiedDate;
This is an immutable pair of “who” & “when” for data insert, and a mutable pair for data update.
If these fields exist, or we add them, the
lastModifiedBy field should be enough to deduce “where” from “who” did the update.
Why the “data solution” is a poor solution
If these fields don’t exist, adding them will be a one-off pain.
More importantly, these fields are unreliable. However good we are at setting them, someday they could be set wrong. Not by our code, never, but by someone else’s code.
Slow Eventual Consistency
CDC follows an eventual consistency model. One system is updated and then the other.
Often it can happen so fast as to be mistaken for immediate consistency.
Either way, we’d expect the change to be propagated to the downstream system within a second or so. Maybe a bit longer for surges of updates.
Generally, we would be right, but sometimes it can be hours, days, or longer.
Hours or Days for Alignment?
We have seen this.
CDC frequently features some sort of buffering. In some cases only full buffers are output. It would be better if incomplete buffers were flushed after a time period, but this is not always the case.
If there is no time threshold, you can’t predict how long it will take for a data change to be available for CDC.
With 1,000 changes a day, and a modest buffer that can hold 365,000 changes, you might wait a year.
For data that changes infrequently, an event-driven architecture would be normally viewed as better than polling.
However, if there is no time guarantee for when events will appear, this view reverses.
Mitigation is a poor solution
You can reduce the buffer size, but the minimum size may hold many change records. Some data doesn’t change frequently so you could still be waiting a long time.
Cheating is a poor solution
If you really must use CDC here, one way is to change records in multiple inconsequential ways. If you know how many changes fill a buffer, you can ensure that many changes are made, to ensure the buffer is flushed.
This is truly a terrible solution, but it a solution.
Instead of saving a change to a data record once, you might save it 1,000 times. If the
lastModifiedDate field was the only thing changed on the last 999 saves, this would be mostly inconsequential.
So for the earlier example of data that really changes 1,000 a day, it now is actually updated 1,000,000 a day. If the buffer can still hold 365,000 changes, we now might wait 8 hours rather than a year.
Just to recap, this is a terrible solution. Like any bad solution, it just makes more problems appear elsewhere.
Database triggers will now fire multiple times for no reason. Any storage used for the CDC stream needs to be 1,000 times larger. Handling of these changes by the recipient needs to coalesce these 1,000 changes back into 1. Auditors will get confused. Anyone who looks at the code will laugh at you. Someone else may change the buffer size to larger. Your boss will probably fire you.
Data that is changed in System A is fed by CDC to System B.
Data that is changed in System B is fed by CDC to System A.
So a change made to System A could be fed to System B, which feeds it to System A, and back to System B, and so on infinitely.
CDC needs filtering logic to stop such feedback loops.
Data Provenance Revisited
This is where data provenance is needed.
Any change originating from System A, can be excluded by System B from sending to System A.
Transactionality does not work with CDC.
A change may transactionally update some data records in System A, be committed, and only then will be visible in the CDC log.
System B cannot abort the transaction as it is already committed.
Constraints are an obvious example. System A may permit a field on a record to be null, and System B may have a non-null constraint on this field.
Accept or reject
Abort is not an option, the choice is to accept or reject. Neither are great choices, so if transactionality is required, CDC may be a bad design choice.
Accept is a poor solution
By passing constraints and accepting known to be invalid data will likely cause problems for processing.
Code somewhere will be coded on the basis that invalid data does not exist, as constraints exist. Not your code, someone else’s.
Reject is a poor solution
Reject is the lesser of two evils here.
If we recall the eventual consistency nature of CDC, we would tolerate not applying the change at this point, although the intention would be to never apply it.
An alert produced by rejection could be the initiator to amend the System A copy of the data again, in a way tolerable to System B. This would allow the two systems to come into line again, although noting that data in System A had an update that System B never saw, a logical coalesce.
The last and grandest of the problems is concurrent updates.
Combining the option to change the same data record in two systems with a slow propagation mechanism gives us the problem of concurrent changes.
A data record may be updated in System A and CDC sends it to System B. While this is occurring the same record is updated in System B in a different way. We have branching changes to merge.
Two One-Way CDCs
CDC is a one-way process. Two-way CDC is two independent one-way CDC setups.
The default behavior for CDC is the incoming record replaces the current copy.
Given feedback loop handling, the change from System A would be applied to System B, the change from System B would be applied to System A. So the change applied to System A is lost on System A but retained on System B. And the change applied to System B is lost on System B but retained on System A. Not ideal.
Sometimes this problem can’t occur. It may be that inserts are only done on System A and updates are only done on System B. So concurrent updates cannot occur on both systems at once.
But usually, it can occur, and the solution is merge processing. Application code has to find a way to blend the incoming record and existing data record that is logically valid. Both systems need the same application logic to arrive at the same outcome, even though they may be different technologies.
Which option or options are used depends on the application. Here are some considerations.
Data provenance yet again
We might wish a “first wins” approach, for booking a hotel room. Or a “last wins” approach, for some sort of online tag game.
Either way, data provenance, and the `lastUpdatedDate` field helps here.
This is not really a merge, it is picking one version of the record to use when it has been updated twice on different systems.
It might be that we have a field counting customer logons. Perhaps they log on once on their cellphone app and this is recorded on System A, and at the same time log on via a browser and this is recorded on System B. So really it’s two logons, we see “+1” from each system but want the result to be “+2”.
Handling the update to differing fields is fairly easy, mostly.
If one system updates a customer’s last name, and another system updates a customer’s first name, then it’s easy enough to combine.
The final version of the data record takes the changed field from one system and the changed from the other system, making two changed fields in the combined copy.
Does the above make any sense logically?
It’s technically easy, but if the customer’s first name and last name change, is it still the same person.
Perhaps the reject logic mentioned above in transactionality is a better approach. Reject a concurrent update where separate fields don’t make sense to change together.
In the real world, there will be many installations with the same data in Hazelcast and elsewhere, updating in both.
An understanding of the key problems outlined above will at least help you approach a solution forewarned of the difficulties ahead. Many of the solutions to these problems are not elegant.
CDC is an important tool in the architect’s toolbox to help with this problem, but it’s not a magic wand.