What Is Change Data Capture?

Change data capture (CDC) is a software process or technology that identifies and tracks changes to data stored in a database, such as inserts, updates, and deletes. While a database is useful for storing the latest state of data, CDC preserves the various states of data over time by providing an audit trail, and it can provide incremental changes to other repositories or applications.

In a very basic example, CDC enables you in December of a given year to look up your home address as of January, even if you had moved in between, and your address in the database reflects the current value.

The change data capture process via the publisher/subscriber method. Multiple databases and applications can subscribe to the change data.
The change data capture process via the publisher/subscriber method. Multiple databases and applications can subscribe to the change data.

How Does Change Data Capture Work?

CDC delivers data on records that changed for database functions such as inserts, updates, and deletes, and makes a record of that change available either within a database itself or to other applications that rely on the data. CDC tools typically rely on the database’s transaction log, which keeps track internally of record changes for the purposes of system recovery. CDC tools leverage that information to deliver database changes to an external system.

What Are Common Methods of Change Data Capture?

There are different approaches that a system can use to capture changes in data. The use of timestamps is one of the most popular methods of CDC, as most systems track when a row was created and most recently modified.

Database transaction logs are also a resource for CDC. Log scanners can identify any changes in these transaction logs. As long as the log scanner can interpret the log, this can be an ideal solution for CDC because it has little impact on the underlying database, delivers changes with low latency, and ensures transaction integrity because every change is tracked in order.

As event streaming has gained popularity, so has the use of the publish/subscribe model of CDC, where a database triggers log or publish change events to a table and shares those changes with the CDC system. The series of updates that CDC delivers looks like a stream of data, making stream processing engines (like Hazelcast Jet) a suitable technology for consuming CDC data.

Other methods of CDC look at version and status numbers on rows.

Is ETL a Method of Change Data Capture?

ETL—the process of extracting, transforming, and loading of data—can often bring new data or updated data from a source system to a database or other application. However, ETL is not a CDC process, as ETL is typically used to move data from one location to another with a transformation during the migration. If an ETL process is used to merely make an exact, up-to-date copy of a data store to another location, CDC can be used instead. This way, CDC can reduce the necessary resources that would otherwise be used by ETL processes because it only applies to data changes. So rather than pulling all data from a source system and recreating a database table from scratch, for example, a CDC process can identify only the new and changed data and propagate those additions and changes to the destination system.


Related Topics

Streaming ETL

Data Pipeline


Microservices Architecture

Java Microservices

Stream Processing

Further Reading

Keeping Your Cache Hot with Real-Time, Push-based Synchronization