Change data capture in production: streaming database writes into reliable event pipelines

How log-based CDC turns PostgreSQL and other databases into event sources, when it beats the transactional outbox, and how to operate ordering, schema drift, and consumer correctness at scale.

Author: Matheus PalmaJune 15, 202611 min read

Software engineeringArchitectureBackendPostgreSQLMessagingReliability

Your analytics warehouse is six hours behind. Search indexes drift after bulk imports. A compliance audit asks for a complete history of who changed what in customers—and your application only publishes events when the code path that wrote the row remembered to call publish(). Meanwhile, a data engineer runs a one-off SQL fix in production, and every downstream projection stays wrong until someone notices.

Change data capture (CDC) reads the database's own mutation log—PostgreSQL's write-ahead log (WAL), MySQL binlog, SQL Server CDC tables—and turns row-level changes into a durable, ordered stream of events. The application does not have to cooperate. Any write that commits, including migrations, admin scripts, and legacy services you forgot still exist, becomes visible to consumers.

This article explains how log-based CDC works, how it compares to patterns you may already use (transactional outbox, triggers, LISTEN/NOTIFY), and what it takes to run CDC pipelines in production without turning schema migrations into midnight incidents. These are the trade-offs I walk teams through when they need cross-cutting data propagation—search, warehouses, audit trails, cache invalidation at scale—not just "notify subscribers when our Node service updates an order."

The problem CDC solves

Most event-driven designs start with application-published messages: the service commits business state, then publishes to Kafka, SNS, or a queue. That works when:

Every writer goes through your service layer.
You control the event shape and can version it deliberately.
You accept that some writes never become events if code forgets or crashes between commit and publish (the dual-write problem the transactional outbox pattern addresses).

CDC inverts the source of truth: the database commit log is the event source. Consumers see INSERT, UPDATE, and DELETE (or equivalent) as they land in storage, regardless of which process performed the write.

That is powerful for:

Analytics and warehouses — near-real-time replication to BigQuery, Snowflake, or ClickHouse without bespoke ETL per table.
Search and read models — keep Elasticsearch or Typesense in sync when multiple services touch the same tables.
Audit and compliance — immutable trail of row changes with before/after images.
Cache and projection invalidation — react to any mutation, including backfills and support scripts.

CDC does not replace domain events. A row change is a storage fact, not necessarily a business fact. orders.status changing from processing to shipped might be one business event or three if intermediate states exist. You still design consumers to interpret physical changes into meaningful domain semantics—or publish explicit domain events from application code when that distinction matters for product logic.

How log-based CDC works

Write-ahead log as the source

Relational databases already record every committed change to a append-only log before (or as) data files update. PostgreSQL's WAL, MySQL's binlog, and SQL Server's transaction log exist for crash recovery and replication. CDC connectors tail that log (or a logical decoding stream derived from it) and emit structured change records.

For PostgreSQL, the typical stack is:

Enable logical replication or logical decoding (wal_level = logical).
Create a replication slot so the server retains WAL until the connector acknowledges consumption (prevents silent data loss if the consumer falls behind).
Run a connector (Debezium, PeerDB, native logical replication subscriber, or a managed service) that decodes WAL into change events.
Publish to a message bus or sink (Kafka, Kinesis, Pub/Sub, S3, warehouse loader).

A simplified flow:

flowchart LR
  App[Application writers] --> DB[(PostgreSQL)]
  DB --> WAL[WAL / logical decoding]
  WAL --> Connector[CDC connector]
  Connector --> Bus[Event bus / stream]
  Bus --> Search[Search index]
  Bus --> WH[Warehouse]
  Bus --> Cache[Cache invalidator]

What a change event contains

Connectors usually emit JSON or Avro with fields like:

Field	Meaning
`op`	`c` create, `u` update, `d` delete, `r` read (snapshot)
`before`	Row image before change (null on insert)
`after`	Row image after change (null on delete)
`source`	Table, schema, database, LSN/offset, transaction id
`ts_ms`	Connector or source timestamp

Why both before and after matter: updates let consumers compute diffs, implement idempotent upserts, or ignore no-op writes. Deletes need the key columns in before so consumers can remove documents from search indexes even when soft-delete is not used.

Initial snapshot vs streaming

When you attach CDC to an existing table, the connector typically:

Snapshots current rows (consistent read, often chunked) and emits read events.
Switches to streaming from a log position captured during the snapshot.

Snapshot + stream is how you bootstrap a search index without a separate bulk export job. The trade-off is load on the source database during snapshot—plan off-peak windows, rate limits, and replica placement (see below).

CDC vs alternatives

Approach	Durability	Sees all writers	Event shape	Operational cost
Application publish	Depends on broker	No (only instrumented paths)	Rich domain events	Low if already on a bus
Transactional outbox	Yes (same DB txn)	No (only app writes via outbox)	Domain-shaped	Medium (relay worker)
DB triggers → queue table	Yes (in txn)	Yes for that DB	Custom	Medium; trigger sprawl
LISTEN/NOTIFY	No	Yes	Tiny hints only	Low; not for pipelines
Log-based CDC	Yes (log + slot)	Yes	Physical row changes	Higher; infra + schema care

Use outbox when the product needs explicit domain events and every writer is your service. Use CDC when you need complete coverage of table mutations across writers you do not fully control, or when analytics/search should not depend on application release cadence.

In practice, mature systems often combine both: outbox for workflow choreography (OrderPlaced), CDC for warehouse sync and search projections (orders row mirror).

Operating CDC in production

Replication slots and WAL retention

A PostgreSQL replication slot tells the server: "do not recycle WAL until I have read up to here." If the CDC consumer stops for hours, WAL accumulates on disk and can fill the volume. Monitor:

Slot lag (bytes or segments behind current WAL).
Oldest xmin / catalog bloat on long-running slots.
Consumer health with alerts before disk pressure becomes an outage.

Have a runbook: restart connector, temporarily pause non-critical consumers, or—in extremis—drop a slot knowing you must resnapshot from a known offset. Dropping a slot without a recovery plan loses unconsumed history.

Where to read from: primary vs replica

Running CDC against a read replica reduces load on the primary and isolates analytics backpressure. Requirements:

Replica must support logical decoding (version and configuration dependent).
Replication lag becomes event lag—acceptable for search and warehouses, often unacceptable for synchronous cache invalidation on the hot path.
Failover changes the replication identity; connectors must handle promotion and slot recreation.

For latency-sensitive invalidation, some teams CDC from the primary with tight resource limits; for warehouse sync, a lag-tolerant replica is standard.

Ordering and partitioning

Within a single PostgreSQL database, changes in one transaction appear in commit order. Across tables or shards, global order does not exist. Design consumers to:

Partition Kafka topics by primary key so all changes for one entity stay ordered.
Use idempotent upserts keyed by (table, pk, lsn) or equivalent offset.
Accept out-of-order updates across unrelated keys and resolve with last-write-wins only when the business allows it.

Schema evolution

Application code and CDC consumers share a contract on column names and types, but migrations change that contract independently.

Common strategies:

Additive changes first — new nullable column; consumers ignore unknown fields; deploy consumer, then writers, then tighten.
Schema registry (Avro/Protobuf with compatibility modes) for strongly typed pipelines.
Debezium schema change events — propagate ALTER TABLE so downstream jobs rebuild or remap columns.
Avoid destructive migrations in one step — rename via expand-contract: add column, backfill, dual-write, switch readers, drop old column.

The failure mode is silent corruption: a NOT NULL column appears, old consumer code drops events, and search is wrong for a week.

Deletes, tombstones, and soft deletes

Physical DELETE events must propagate tombstones to compacted topics (Kafka) or explicit delete operations in search. If you only use soft deletes (deleted_at), CDC emits UPDATE events; consumers must treat non-null deleted_at as removal. Mixing both without a documented rule confuses projection code.

PII and security

CDC streams often contain more data than any single API exposes—every column, including internal flags and notes. Apply:

Column masking or exclude lists in the connector.
Encryption in transit and at rest on the bus.
Separate topics per sensitivity tier with IAM boundaries.
Retention limits aligned with GDPR and internal policy.

Treating the CDC topic as "internal only" without access controls is a common audit finding.

Practical example: PostgreSQL → Kafka with a typed consumer

The following is a minimal but realistic slice: PostgreSQL logical replication enabled, Debezium-style JSON on Kafka, and a Node consumer that maintains a search document. It is illustrative—not a full Debezium deployment guide.

Source table:

CREATE TABLE products (
  id          uuid PRIMARY KEY,
  tenant_id   uuid NOT NULL,
  sku         text NOT NULL,
  title       text NOT NULL,
  price_cents integer NOT NULL,
  updated_at  timestamptz NOT NULL DEFAULT now()
);

-- Publication for logical replication (simplified; real setups use connector-managed publications)
-- CREATE PUBLICATION products_pub FOR TABLE products;

Example CDC message (Debezium envelope, abbreviated):

{
  "op": "u",
  "before": {
    "id": "7c9e6679-7425-40de-944b-e07fc1f90ae7",
    "tenant_id": "a0eebc99-9c0b-4ef8-bb6d-6bb9bd380a11",
    "sku": "WIDGET-01",
    "title": "Widget",
    "price_cents": 1299,
    "updated_at": "2026-06-14T10:00:00Z"
  },
  "after": {
    "id": "7c9e6679-7425-40de-944b-e07fc1f90ae7",
    "tenant_id": "a0eebc99-9c0b-4ef8-bb6d-6bb9bd380a11",
    "sku": "WIDGET-01",
    "title": "Widget Pro",
    "price_cents": 1499,
    "updated_at": "2026-06-15T09:12:00Z"
  },
  "source": {
    "schema": "public",
    "table": "products",
    "lsn": 26789012345,
    "txId": 981234
  }
}

Consumer: idempotent projection with offset deduplication

import type { EachMessagePayload } from "kafkajs";

type ProductRow = {
  id: string;
  tenant_id: string;
  sku: string;
  title: string;
  price_cents: number;
  updated_at: string;
};

type DebeziumEnvelope = {
  op: "c" | "u" | "d" | "r";
  before: ProductRow | null;
  after: ProductRow | null;
  source: { table: string; lsn: number };
};

/** Dedup store: Redis SET or DB table `(topic, partition, offset)` processed */
async function alreadyProcessed(key: string): Promise<boolean> {
  // implementation-specific
  return false;
}

async function markProcessed(key: string): Promise<void> {}

export async function handleProductChange({
  topic,
  partition,
  message,
}: EachMessagePayload): Promise<void> {
  if (!message.value) return;

  const offsetKey = `${topic}:${partition}:${message.offset}`;
  if (await alreadyProcessed(offsetKey)) return;

  const env = JSON.parse(message.value.toString()) as DebeziumEnvelope;
  if (env.source.table !== "products") return;

  const docId = env.after?.id ?? env.before?.id;
  if (!docId) return;

  if (env.op === "d") {
    await searchIndex.delete(docId);
    await markProcessed(offsetKey);
    return;
  }

  const row = env.after;
  if (!row) return;

  await searchIndex.upsert(docId, {
    tenantId: row.tenant_id,
    sku: row.sku,
    title: row.title,
    priceCents: row.price_cents,
    updatedAt: row.updated_at,
    /** Store LSN for debugging reorder issues */
    sourceLsn: env.source.lsn,
  });

  await markProcessed(offsetKey);
}

Key design choices in this handler:

Idempotency via offset keys (or (id, lsn) if you replay topics).
Delete path uses before when after is null.
Physical row → search document mapping stays dumb; business rules belong upstream or in a dedicated enrichment step.
LSN stored for support when someone asks "why did search update twice?"

For consulting engagements, I treat this consumer as one projection in a larger contract: topic naming, schema registry, dead-letter queues, and lag SLOs are defined before the first connector goes to production—not after the WAL disk alarm fires.

Common mistakes and pitfalls

Treating CDC events as domain events — downstream workflow engines that need ShipmentDispatched will become brittle if they infer intent from five column toggles. Map physical changes explicitly or keep outbox for workflows.
Ignoring replication slot lag — the consumer is down over a long weekend; WAL fills the disk; the primary stops accepting writes. Alert on lag bytes, not only "connector process up."
Running heavy snapshot on the primary during peak — bootstrap search from a replica or throttle snapshot parallelism.
Non-idempotent consumers — at-least-once delivery is the default; without dedup keys, every replay duplicates warehouse rows or search documents.
Breaking changes in one migration — rename column without dual-read period; consumer deserializer fails; events skip silently if errors are swallowed.
Assuming global ordering — concurrent updates to related tables (orders and order_lines) may arrive out of order; design compensating reads or single-table denormalization where order matters.
Shipping full PII to every subscriber — one compromised consumer credential exposes the entire table history, not just one API's view.
No dead-letter path — poison messages block partition processing; operators need a DLQ, quarantine, and replay tooling.

Conclusion

Change data capture turns the database commit log into a durable, comprehensive mutation stream—the right tool when analytics, search, audit, or cross-service projections must reflect every committed write, not only the paths your current microservices remember to instrument. It complements, rather than replaces, application-level patterns like the transactional outbox: physical CDC for breadth and timeliness of data movement; domain events for product semantics and choreography.

Operating CDC well means respecting replication slots, schema evolution, ordering limits, and consumer idempotency as first-class design constraints—not production surprises. Teams that get this right ship platforms where warehouses stay current, search recovers from backfills automatically, and compliance questions get answered from the log instead of from hope.

If you are standing up event-driven architecture and need help choosing between outbox, CDC, and broker topology—or hardening pipelines before they become the critical path—I work with teams on scalable, production-ready backends where data correctness and operability are non-negotiable.

Get an email when new articles are published. No spam — only new posts from this blog.