Change Data Capture: Real-Time Database Tracking with CDC
CDC tracks database changes and streams them to downstream systems. Learn how Debezium, log-based CDC, and trigger-based approaches work.
Change Data Capture: Tracking Database Changes in Real Time
Your application writes to a PostgreSQL database. Somewhere downstream, an analytics system needs to know what changed. A search index needs to update. A cache needs to invalidate. A data warehouse needs to load.
You could poll the database every few seconds, but that is expensive at scale and adds latency. You could modify your application to write to a message queue alongside the database, but that couples your application logic to your data pipeline. Change Data Capture (CDC) solves this by reading the database’s transaction log and streaming changes as they happen.
CDC captures inserts, updates, and deletes from database tables and publishes them as events to a message broker or streaming platform. The source database is untouched. The application is unmodified. Changes flow automatically.
How CDC Works
Most CDC implementations read the database’s write-ahead log (WAL) or transaction log. This log records every modification made to the database at the storage level. CDC tools tail this log and translate log entries into events.
flowchart LR
App[Application] -->|writes| DB[(PostgreSQL)]
DB -->|WAL| CDC[CDC Agent]
CDC -->|events| Kafka[Kafka / Message Broker]
Kafka --> Consumer1[Search Index]
Kafka --> Consumer2[Cache]
Kafka --> Consumer3[Data Warehouse]
The key advantage of log-based CDC is that it captures every change without touching the source tables. There is no polling, no added load on the source database, and no application code changes.
Debezium: The Open-Source CDC Platform
Debezium is the most widely-used CDC platform for the JVM ecosystem. It reads transaction logs from MySQL, PostgreSQL, MongoDB, and other databases, converting changes into events that publish to Kafka.
import io.debezium.config.Configuration;
import io.debezium.engine.DebeziumEngine;
import io.debezium.engine.format.ChangeEventFormat;
public class MySqlCdcExample {
public static void main(String[] args) throws Exception {
Configuration config = Configuration.create()
.with("name", "mysql-cdc-connector")
.with("connector.class", "MySqlConnector")
.with("database.hostname", "localhost")
.with("database.port", 3306)
.with("database.user", "debezium")
.with("database.password", "password")
.with("database.server.id", "184054")
.with("topic.prefix", "mysql-cdc")
.with("schema.history.internal.kafka.bootstrap.servers", "localhost:9092")
.with("schema.history.internal.kafka.topic", "schema-changes.inventory")
.build();
DebeziumEngine<ChangeEvent<String, String>> engine = DebeziumEngine.create(ChangeEventFormat.of(Connect.class))
.using(config.asProperties())
.notifying(record -> {
System.out.println("Key: " + record.key());
System.out.println("Value: " + record.value());
})
.build();
}
}
Debezium handles the messy details: reading the correct position in the WAL, managing schema evolution, handling snapshots of existing data, and publishing events with consistent structure.
CDC Event Structure
A CDC event contains the before and after state of a changed row. The event envelope includes metadata:
{
"op": "u",
"ts_ms": 1711523456789,
"before": {
"id": 12345,
"email": "alice@example.com",
"loyalty_tier": "silver"
},
"after": {
"id": 12345,
"email": "alice@example.com",
"loyalty_tier": "gold"
},
"source": {
"db": "orders_db",
"table": "customers",
"lsn": 12345678
}
}
The op field indicates the operation type: c for create (insert), u for update, d for delete, and r for read (snapshot). The before and after fields capture the row state before and after the change.
Snapshotting: Initial Load Problem
CDC tools need to know where to start reading the log. For a brand new source, the log beginning is fine. For an existing source with months of data, you need a snapshot of current data before CDC can begin streaming incremental changes.
Most CDC tools support snapshot modes:
initial: Snapshot the database on first run, then stream changesschema_only: Skip snapshot, only stream changes from now on (loses existing data)when_needed: Snapshot when the offset is lost and cannot be recovered
Initial snapshots can be large. A table with 500 million rows takes time and resources to snapshot. Plan for this when setting up CDC for the first time on a large database.
Trigger-Based CDC
Before log-based CDC was practical, trigger-based CDC was the standard approach. You create database triggers on source tables that fire on INSERT, UPDATE, and DELETE operations, writing change records to a staging table.
CREATE TABLE customer_changes (
change_id BIGSERIAL PRIMARY KEY,
operation VARCHAR(1),
customer_id BIGINT,
email VARCHAR(255),
loyalty_tier VARCHAR(20),
changed_at TIMESTAMP DEFAULT NOW()
);
CREATE OR REPLACE FUNCTION capture_customer_changes()
RETURNS TRIGGER AS $$
BEGIN
IF TG_OP = 'INSERT' THEN
INSERT INTO customer_changes (operation, customer_id, email, loyalty_tier)
VALUES ('I', NEW.id, NEW.email, NEW.loyalty_tier);
RETURN NEW;
ELSIF TG_OP = 'UPDATE' THEN
INSERT INTO customer_changes (operation, customer_id, email, loyalty_tier)
VALUES ('U', NEW.id, NEW.email, NEW.loyalty_tier);
RETURN NEW;
ELSIF TG_OP = 'DELETE' THEN
INSERT INTO customer_changes (operation, customer_id, email, loyalty_tier)
VALUES ('D', OLD.id, OLD.email, OLD.loyalty_tier);
RETURN OLD;
END IF;
RETURN NULL;
END;
$$ LANGUAGE plpgsql;
CREATE TRIGGER trg_customer_cdc
AFTER INSERT OR UPDATE OR DELETE ON customers
FOR EACH ROW EXECUTE FUNCTION capture_customer_changes();
Trigger-based CDC adds overhead to every write operation on the source tables. For write-heavy workloads, this overhead can be significant. Log-based CDC is the preferred approach for this reason.
Log-based CDC vs trigger-based vs polling trade-offs:
| Approach | Source load | Latency | Infrastructure | Exactly-once |
|---|---|---|---|---|
| Log-based CDC (Debezium) | Minimal — reads WAL | Seconds | Moderate — CDC agent + broker | At-least-once (idempotent consumers) |
| Trigger-based CDC | High — triggers fire on every write | Near real-time | Low — SQL only | At-least-once (can lose triggers) |
| Polling | High — runs queries repeatedly | Minutes to hours | Low — simple scheduler | Depends on polling interval |
Ordering Guarantees
CDC events capture database changes in the order they were committed. Within a single table, order is preserved. Across tables in the same database, if they share a transaction, the changes are emitted together.
Cross-database CDC ordering depends on the database’s transaction log behavior. Most log-based CDC tools provide at-least-once delivery. Design your consumers to be idempotent.
Capacity Estimation for CDC
CDC capacity planning comes down to event volume, broker throughput, and consumer processing speed.
Event volume estimation:
A database with 1,000 writes per second and 500-byte rows produces CDC events at roughly 3-5x the raw write volume (before/after states, metadata, envelope). At 1,000 writes/sec, expect 3,000-5,000 CDC events per second.
For a 500-byte average row with full before/after capture:
- 1,000 writes/sec × ~1 KB/event ≈ 1 MB/sec or ~86 GB/day
- A Kafka broker doing 50 MB/sec per broker handles this without breaking a sweat
- Partition count drives parallel consumer throughput — one partition per source table is a reasonable starting point
Snapshot capacity:
An initial snapshot of a 500 million row table at 100,000 rows/sec runs about 80 minutes. While the snapshot runs, CDC keeps capturing WAL changes. The snapshot must finish before the WAL offset it started from gets purged. Set snapshot.lock.timeout.ms appropriately for large tables — PostgreSQL’s wal_keep_segments must retain enough WAL from snapshot start until the snapshot completes.
Sizing the CDC agent:
Debezium runs as a single-threaded connector per source database by default. For high-throughput sources, increase max.batch.size and max.queue.size to buffer bursts. Watch lag, queue utilization, and schema history progress.
When to Use CDC
Use CDC when your source system cannot emit change events directly and you need near-real-time movement without polling. Multiple independent consumers on the same change stream is another good signal — CDC decouples source databases from downstream systems cleanly.
Do not use CDC when your managed cloud database has limited WAL access. Do not use it for simple periodic batch loads where ETL is easier. Do not use it if write latency overhead is unacceptable for sensitive OLTP workloads. And if you need to capture application-level logic changes rather than physical database writes, triggers are the better tool.
Common CDC use cases:
- Data warehouse feeding: Capture changes from OLTP databases and feed them into a data warehouse without touching the source system.
- Search index updates: Keep Elasticsearch or OpenSearch indexes synchronized with database state.
- Microservices data sharing: Instead of sharing databases across services, each service consumes the CDC stream from a shared source.
- Cache invalidation: Update or invalidate caches when source data changes, without application-level cache management.
For related reading on how CDC fits into data pipelines, see Extract-Transform-Load and Pipeline Orchestration. For a broader view of how databases publish change events, see Event-Driven Architecture.
Common Pitfalls
CDC introduces complexity. The change stream is a new system to operate, monitor, and troubleshoot.
Schema changes on the source table require careful handling. Adding a column produces CDC events with the new column. If your consumers do not handle this gracefully, they break.
Lag monitoring is critical. If the CDC agent falls behind the WAL, you build up delay in your downstream systems. Monitor the consumer lag and set alerts.
Exactly-once delivery is hard to guarantee across the full CDC-to-consumer path. Most CDC tools provide at-least-once. Idempotent consumers are essential.
Observability for CDC
CDC runs in the background. When it breaks, you usually find out because downstream consumers start missing events — not because the agent pages you.
What to track:
- CDC agent heartbeat: confirm the agent is alive and publishing. A Debezium connector that goes silent is a disaster waiting to happen.
- WAL lag: how far behind the agent is reading from the WAL. Rising lag means the broker cannot keep up with source writes, or the agent is stuck.
- Snapshot progress: when running initial snapshots, track rows scanned vs total. Large tables take hours.
- Schema history age: for databases with DDL changes, the schema history topic grows. Make sure something is consuming and cleaning it.
- Consumer lag per topic partition: each Kafka partition representing a source table should have a low consumer lag.
Alerts to set:
- CDC agent is down for more than 60 seconds
- WAL lag exceeds 5 minutes of wall-clock time
- Consumer group has no active members
- Schema history topic has unprocessed messages older than 1 hour
Quick Recap
- Log-based CDC reads the WAL — every committed change captured, no source tables touched, no application changes needed.
- Debezium is the standard open-source CDC tool for Kafka. Snapshots, schema evolution, offset management — all handled.
- CDC events carry before/after row state. The
opfield tells you insert, update, or delete. - Make consumers idempotent. CDC gives at-least-once, not exactly-once.
- Watch WAL lag, agent heartbeat, and snapshot progress. A silent CDC agent cascades into downstream failures.
Conclusion
Change Data Capture solves the problem of getting database changes to downstream systems without modifying applications or adding polling overhead. Log-based CDC reads the transaction log directly, capturing every committed change in real time.
Debezium is the standard open-source CDC tool for Kafka-based pipelines. It handles snapshots, schema evolution, and offset management out of the box. For teams already running Kafka, Debezium is a natural choice.
CDC is a powerful primitive. Once you have a change stream, you can feed multiple independent consumers from a single source without the source ever knowing. That is the foundation of the streaming-first architecture that modern data platforms are built on.
Category
Related Posts
Apache Flink: Advanced Stream Processing at Scale
Apache Flink provides advanced stream processing with sophisticated windowing and event-time handling. Learn its architecture, programming model, and use cases.
Data Migration: Strategies and Patterns for Moving Data
Learn proven strategies for migrating data between systems with minimal downtime. Covers bulk migration, CDC patterns, validation, and rollback.
Dead Letter Queues: Handling Message Failures Gracefully
Design and implement Dead Letter Queues for reliable message processing. Learn DLQ patterns, retry strategies, monitoring, and recovery workflows.