Change data capture in production: logical replication, event shape, and consumer design

How to turn database writes into reliable events with CDC: PostgreSQL logical replication, Debezium-style connectors, ordering guarantees, schema evolution, and consumer pitfalls.

Autor: Matheus Palma19. Juni 20268 Min. Lesezeit

Software engineeringArchitectureBackendPostgreSQLMessagingReliability

Your analytics warehouse is six hours behind production. Search indexes drift after a bad deploy. A new microservice needs every users row change, but the team that owns the monolith will not add publish calls to fifty code paths. You ask for a nightly dump; product wants near-real-time. The failure is not “events are hard”—it is that you are trying to derive change streams from application code that was never written to be a message bus.

Change data capture (CDC) reads the database’s own write log and emits structured change events: inserts, updates, deletes, with before/after values. When you need many downstream consumers of the same truth, or you cannot trust every service to publish consistently, CDC is often the most honest integration point. This article covers how CDC works under the hood, how it differs from the transactional outbox, and what breaks in production once consumers multiply.

CDC vs application-level publishing

Two families of event production compete for the same architectural slot:

Approach	Strength	Weakness
Application publish (outbox, direct broker calls)	Rich domain events, explicit intent, easy to omit fields	Every write path must participate; refactors miss publishes
CDC	Captures all row changes, including ad-hoc SQL and legacy jobs	Events are storage-shaped, not domain-shaped; harder to express “OrderPlaced”

They are complementary, not exclusive. In consulting work on event-driven migrations, the teams that succeed treat CDC as infrastructure sync (search, cache, warehouse, read models) and the outbox as workflow signals (emails, payments, sagas). Mixing both without boundaries produces duplicate events and confused consumers.

What CDC actually reads

Relational databases record mutations in a write-ahead log (WAL) for crash recovery. Logical decoding (PostgreSQL) or binlog (MySQL) exposes those changes as a stream of logical operations instead of physical page diffs. A connector (Debezium, Maxwell, native cloud CDC) tails that stream, serializes events, and publishes to Kafka, Kinesis, Pub/Sub, or similar.

The important implication: CDC events reflect committed transactions. Uncommitted writes never appear. Rolled-back transactions disappear. That is the same durability boundary as your database—useful when you need a faithful mirror, dangerous when you expected pre-commit side effects.

PostgreSQL logical replication: the mechanics that matter

On PostgreSQL, CDC typically uses logical replication slots and a publication that names which tables (or all tables) to replicate.

-- Simplified setup; run with appropriate roles and monitoring in production.

ALTER SYSTEM SET wal_level = logical;
-- restart required after wal_level change

CREATE PUBLICATION app_cdc FOR TABLE users, orders, order_items;

-- Connector creates a replication slot, e.g.:
-- SELECT * FROM pg_create_logical_replication_slot('debezium', 'pgoutput');

Operational details that decide whether CDC survives week two:

Replication slots and disk growth

A slot tells PostgreSQL not to recycle WAL until the consumer has acknowledged progress. If your connector stops for a weekend, WAL accumulates. Disk fills. The primary may halt writes. Monitor slot lag (pg_replication_slots, bytes behind) as a paging metric, not a dashboard curiosity.

Permissions and security

CDC readers need replication privileges. Treat connector credentials like backup infrastructure: narrow table scope, no superuser in application paths, rotate keys, and never expose raw change streams to untrusted tenants without row filters or downstream ACL enforcement.

Schema changes

ALTER TABLE is where CDC projects go to suffer. Adding a nullable column is usually fine. Renaming columns, changing types, or splitting tables breaks consumers that assumed stable shapes. Production teams adopt expand-contract migrations (see the dedicated article on zero-downtime database migrations) because CDC and read models cannot be redeployed atomically with the writer.

Event shape: from row diffs to useful messages

Raw Debezium-style payloads are verbose. A single UPDATE might include:

{
  "op": "u",
  "before": { "id": "ord_1", "status": "pending", "version": 4 },
  "after": { "id": "ord_1", "status": "paid", "version": 5 },
  "source": { "table": "orders", "lsn": 9843210 }
}

Consumers rarely want the full row on every change. Common transformations:

Keying — Partition by primary key (ord_1) so all changes for one entity stay ordered on one partition.
Diff projection — Emit only changed columns when before and after differ, reducing noise for wide tables.
Envelope metadata — Attach trace_id, tenant_id, or occurred_at if present on the row; do not invent domain names the table does not store.
Tombstones — Deletes must propagate; search indexes and caches need explicit removal, not silent staleness.

Why this matters: a warehouse loader and a Redis cache invalidator need different shapes. A stream processing layer (Kafka Streams, Flink, Materialize, or a small consumer fleet) is often cheaper than forcing every subscriber to parse Debezium JSON.

Ordering, duplicates, and delivery semantics

CDC inherits at-least-once delivery from the pipeline: restarts replay from the last acknowledged offset. Consumers must be idempotent.

Per-key ordering

Kafka preserves order within a partition. If you key by user_id, all changes for that user arrive in commit order. There is no global total order across users—usually fine for entity-scoped projections, insufficient for cross-table invariants without application-level coordination.

The update-delete-update trap

Rapid successive writes to the same row may collapse in a compacted topic if you misuse log compaction. For CDC, prefer retention by time on raw topics and let derived topics handle compaction policies explicitly.

Dual writes still exist—on the consumer side

CDC eliminates “forgot to publish” on the writer, but consumers still dual-write when they update Elasticsearch and send a webhook in one handler. Apply the same idempotency patterns as webhook receivers: dedupe keys from (table, primary_key, lsn) or an explicit version column.

When CDC is the wrong tool

CDC is a poor fit when:

Business intent must be captured (“customer upgraded plan with proration”) but the database only stores resulting rows.
PII minimization requires stripping fields that are still present in the WAL path—masking must happen in a governed stream processor, not hope.
Latency under tens of milliseconds is required; CDC adds decode and broker hop; use synchronous reads or read-your-writes routing instead.
The dataset is tiny and stable; a periodic snapshot plus diff may be simpler than operating slots forever.

If every event is a carefully modeled domain command, an outbox remains clearer. If every downstream system needs a mirror of table state, CDC wins.

Practical example: PostgreSQL row to idempotent search indexer

The following TypeScript consumer sketches the happy path: parse a change event, build a deterministic idempotency key, upsert or delete in OpenSearch, and ack only after side effects succeed.

import { z } from "zod";

const CdcEventSchema = z.object({
  op: z.enum(["c", "u", "d"]), // create, update, delete
  before: z.record(z.unknown()).nullable(),
  after: z.record(z.unknown()).nullable(),
  source: z.object({
    table: z.string(),
    lsn: z.number(),
  }),
});

type SearchDoc = { id: string; email: string; status: string; indexedAt: string };

export async function handleUserChange(
  raw: unknown,
  deps: {
    upsertUser: (doc: SearchDoc) => Promise<void>;
    deleteUser: (id: string) => Promise<void>;
    markProcessed: (idempotencyKey: string) => Promise<boolean>;
  },
): Promise<void> {
  const event = CdcEventSchema.parse(raw);
  if (event.source.table !== "users") return;

  const idempotencyKey = `users:${event.after?.id ?? event.before?.id}:${event.source.lsn}`;
  const firstTime = await deps.markProcessed(idempotencyKey);
  if (!firstTime) return; // duplicate delivery

  if (event.op === "d") {
    const id = String(event.before?.id);
    await deps.deleteUser(id);
    return;
  }

  const row = event.after ?? {};
  const doc: SearchDoc = {
    id: String(row.id),
    email: String(row.email),
    status: String(row.status),
    indexedAt: new Date().toISOString(),
  };
  await deps.upsertUser(doc);
}

Production extensions you would add immediately:

Dead-letter queue after N failures with the raw payload for replay.
Schema version in the event envelope when columns evolve.
Tracing with W3C trace context propagated from the writer when available (see distributed tracing).
Backpressure: pause partition consumption when the search cluster is red.

In freelance migrations, this pattern ships faster than rewriting the monolith—provided stakeholders accept eventual consistency SLAs on the search index and you document lag alarms.

Operating CDC: metrics and runbooks

Treat the connector as a tier-1 dependency:

Signal	What it tells you
Replication slot lag (bytes/time)	Risk of disk pressure on primary
Consumer lag per partition	Downstream projections falling behind
Error rate by table	Schema drift or poison messages
End-to-end latency (commit → indexed)	Product-visible freshness

Runbooks should cover: connector restart (safe with idempotent consumers), slot recreation (may require snapshot + backfill), and coordinated schema migrations with consumer deployments.

Common mistakes and pitfalls

Treating CDC events as domain events — orders.status changed from pending to paid is not OrderPaid; finance workflows still belong in explicit publishers or process managers.
Ignoring delete propagation — Updates sync; deletes do not “obviously” remove derived state. Tombstone handling must be designed.
Unbounded WAL retention — A stopped connector without alerts is a timed disk outage.
Wide tables on hot paths — Replicating every column on every touch burns broker bandwidth; publications and column lists should be intentional.
Assuming global ordering — Cross-table sagas need orchestration; CDC alone does not guarantee “inventory decremented before shipment row exists.”
Skipping idempotency — At-least-once is the default; exactly-once end-to-end requires transactional sinks or dedupe stores, not wishful thinking.

Conclusion

Change data capture turns the database into an authoritative change stream without rewriting every application path. It excels at keeping search indexes, caches, warehouses, and read models aligned with committed state. It does not replace domain modeling, outbox workflows, or careful schema discipline.

Key takeaways:

Use CDC for storage-shaped sync; use outbox or explicit publish for intent-shaped workflows.
Operate replication slots, lag, and schema migrations as first-class production concerns.
Design consumers to be idempotent, key-ordered, and explicit about deletes.

Teams building scalable, event-driven backends usually need both patterns in the toolbox—not a single “events” checkbox. If you are untangling a monolith, standing up a new read model, or reviewing whether CDC belongs in your architecture, contact is the best place to start a conversation; background on production systems work is on About.

E-Mail erhalten, wenn neue Artikel erscheinen. Kein Spam — nur neue Beiträge von diesem Blog.

Über Resend. Abmeldung in jeder E-Mail möglich.