PostgreSQL change data capture in production: when WAL beats the outbox—and when it does not
Compare CDC from PostgreSQL logical replication to transactional outbox relays: ordering, schema evolution, operational cost, and patterns that keep event-driven backends correct under failure.
Your team ships the transactional outbox, runs the relay worker, and still hits a ceiling: publishing lag grows with write volume, schema migrations require coordinated deploys, and every new microservice wants the same row-level stream. Meanwhile, another team points Debezium at the WAL and fans out orders changes to Kafka in minutes. Both approaches emit domain events—but they make different promises about ordering, schema safety, and who owns failure.
Change data capture (CDC) reads the database's write-ahead log (via logical replication) and turns committed row changes into a durable event stream. It is a powerful alternative—or complement—to application-level outbox tables. This article explains how CDC works on PostgreSQL, where it fits relative to the outbox, and what production teams get wrong when they treat the binlog as a product API without guardrails.
The problem CDC solves
Event-driven architectures need a trustworthy stream of facts: OrderPaid, InventoryReserved, UserDeactivated. Application code can write those facts explicitly (outbox) or the infrastructure can derive them from INSERT/UPDATE/DELETE on tables (CDC).
CDC shines when:
- Many consumers need the same table's changes (search indexers, caches, analytics, audit).
- You want to decouple event emission from application deploys—the stream exists even if the API service forgets to publish.
- Write volume makes polling an outbox table expensive or slow.
CDC is a poor default when:
- Events are rich domain objects that do not map 1:1 to rows (aggregates spanning tables, computed fields, privacy redaction).
- You need strict business semantics in the payload (
OrderCancelledBecausePaymentFailedvs a rawstatuscolumn flip). - Your organization is not ready to operate connectors, schema registries, and replay tooling.
Understanding that split keeps you from adopting CDC because it is fashionable, or rejecting it because "we already have an outbox."
How PostgreSQL logical replication enables CDC
PostgreSQL physical replication copies disk blocks; logical replication decodes the WAL into logical change records for subscribed tables. Tools like Debezium (with the PostgreSQL connector) or managed offerings (AWS DMS, GCP Datastream, etc.) sit on that stream and emit JSON/Avro events to Kafka, Pulsar, or webhooks.
At a high level:
flowchart LR
App[Application writes] --> PG[(PostgreSQL)]
PG --> WAL[WAL]
WAL --> Slot[Replication slot]
Slot --> Connector[CDC connector]
Connector --> Bus[Message bus]
Bus --> C1[Search indexer]
Bus --> C2[Cache projector]
Bus --> C3[Analytics]
Key mechanics:
- Publication — defines which tables (or columns) are replicated.
- Replication slot — tracks consumer progress; WAL is retained until the slot catches up (a common disk-pressure failure mode).
- Connector — reads changes, maps types, optionally routes by table/topic, handles heartbeats.
Committed transactions appear in commit order for a single database, which is valuable for projectors building read models—though not a substitute for cross-table saga ordering without additional design.
What a CDC event looks like
Debezium-style payloads typically include before and after row images, operation type (c/u/d), source metadata (LSN, transaction id, table), and timestamps. Consumers branch on op and apply idempotent upserts:
{
"op": "u",
"before": { "id": "ord_42", "status": "pending", "total_cents": 9900 },
"after": { "id": "ord_42", "status": "paid", "total_cents": 9900 },
"source": {
"table": "orders",
"lsn": 2847362519,
"txId": 812044
}
}
That is excellent for data integration; it is often awkward as a public integration contract for partner teams who should not see internal column names or intermediate states.
CDC versus transactional outbox
| Dimension | Transactional outbox | WAL / CDC |
|---|---|---|
| Atomicity with business write | Same transaction as domain row | Implicit—only committed rows appear |
| Event shape | You design JSON schemas | Row images; may need transforms |
| Coupling to schema | Loose—payload is your API | Tight—column renames break consumers |
| Operational owner | App team (table + worker) | Platform / data / SRE |
| Latency | Polling or NOTIFY-bound | Typically lower, streaming |
| Deletes & GDPR | Explicit tombstone events | Hard deletes vanish unless you model them |
| Multi-table aggregates | One outbox row per business event | Requires joins or enrichment downstream |
In consulting engagements, I usually recommend outbox first when the event is a product contract ("notify fulfillment when paid") and CDC when the problem is keeping derived stores in sync with OLTP tables. Mature systems often run both: outbox for partner-facing domain events, CDC for internal projections—provided you document which stream is authoritative for what.
Designing a production CDC pipeline
Publications and least privilege
Create a dedicated replication user with REPLICATION privilege and a publication scoped to required tables—never FOR ALL TABLES in shared databases unless you accept accidental data leakage when someone adds a secrets table.
CREATE PUBLICATION app_cdc FOR TABLE orders, order_lines, inventory_levels;
CREATE USER cdc_replicator WITH REPLICATION LOGIN PASSWORD '...';
GRANT SELECT ON orders, order_lines, inventory_levels TO cdc_replicator;
Column-level publications (PostgreSQL 15+) help when only a subset should leave the database boundary.
Replication slots and WAL retention
A stalled connector holds a slot; PostgreSQL retains WAL, disk fills, writes block. Runbooks need:
- Alerts on replication lag (bytes behind, time behind).
- Maximum slot age policies and on-call steps to drop or reset a poisoned slot (with explicit acceptance of re-snapshot cost).
- Heartbeats from the connector during idle periods so slots advance.
This is the CDC equivalent of "our outbox worker died three days ago"—except the blast radius is cluster-wide storage.
Schema evolution
Row-based events break when you RENAME COLUMN or change types without coordination. Mitigations:
- Expand-contract migrations — add new column, dual-write, migrate consumers, drop old column (same discipline as zero-downtime API changes).
- Schema registry — Avro/Protobuf with compatibility rules for serialized payloads after an enrich/transform step.
- Event enrichment — a stream processor maps
status = 'paid'toOrderPaidv2, hiding storage details.
Treat CDC topics as internal integration until a transform layer produces versioned domain events.
Ordering, duplicates, and idempotency
CDC delivers at-least-once. Connectors may restart from an earlier LSN; consumers must be idempotent (natural keys, UPSERT, or idempotency keys in a store). Per-partition ordering in Kafka helps if you key by aggregate_id, but cross-partition ordering is not guaranteed.
For read-model projectors, store source LSN or transaction id and ignore stale updates:
type OrderRow = { id: string; status: string; total_cents: number };
type DebeziumUpdate = {
op: "c" | "u" | "d";
before: OrderRow | null;
after: OrderRow | null;
source: { lsn: number; table: string };
};
export async function applyOrderChange(
db: { query: (sql: string, params: unknown[]) => Promise<unknown> },
event: DebeziumUpdate,
): Promise<void> {
if (event.op === "d") {
await db.query(
`DELETE FROM read_orders WHERE id = $1 AND last_lsn < $2`,
[event.before!.id, event.source.lsn],
);
return;
}
const row = event.after!;
await db.query(
`
INSERT INTO read_orders (id, status, total_cents, last_lsn)
VALUES ($1, $2, $3, $4)
ON CONFLICT (id) DO UPDATE
SET status = EXCLUDED.status,
total_cents = EXCLUDED.total_cents,
last_lsn = EXCLUDED.last_lsn
WHERE read_orders.last_lsn < EXCLUDED.last_lsn
`,
[row.id, row.status, row.total_cents, event.source.lsn],
);
}
The last_lsn guard prevents an older event from overwriting newer state after a replay or out-of-order edge case.
Initial snapshots
New consumers need a baseline. Debezium can snapshot tables before streaming; large tables require chunked reads, IO awareness, and sometimes off-peak windows. Plan snapshot duration in the same way you plan backfills for new read models—CDC does not remove that cost, it shifts it to connector configuration.
Practical example: hybrid outbox + CDC boundary
Consider checkout: the API must reliably tell the warehouse service an order is paid (business event), and a search service must index order rows for the admin UI (data sync).
Pattern:
- In one transaction: update
orders.status, insert intooutboxwith{ type: "OrderPaid", orderId, paidAt }. - Outbox relay publishes
OrderPaidtoorders.eventsfor fulfillment. - CDC streams
ordersrow changes toorders.cdcfor search/analytics. - Search projector consumes
orders.cdconly; it never drives side effects that charge money.
import type { Pool, PoolClient } from "pg";
type OutboxRow = {
aggregate_id: string;
event_type: string;
payload: Record<string, unknown>;
};
export async function markOrderPaid(
pool: Pool,
orderId: string,
paidAt: string,
): Promise<void> {
const client: PoolClient = await pool.connect();
try {
await client.query("BEGIN");
const updated = await client.query<{ id: string }>(
`
UPDATE orders
SET status = 'paid', paid_at = $2
WHERE id = $1 AND status = 'pending'
RETURNING id
`,
[orderId, paidAt],
);
if (updated.rowCount === 0) {
await client.query("ROLLBACK");
throw new Error("order not pending or not found");
}
const outbox: OutboxRow = {
aggregate_id: orderId,
event_type: "OrderPaid",
payload: { orderId, paidAt },
};
await client.query(
`
INSERT INTO outbox (aggregate_id, event_type, payload)
VALUES ($1, $2, $3::jsonb)
`,
[outbox.aggregate_id, outbox.event_type, JSON.stringify(outbox.payload)],
);
await client.query("COMMIT");
} catch (e) {
await client.query("ROLLBACK");
throw e;
} finally {
client.release();
}
}
Fulfillment subscribes to outbox-derived OrderPaid with a versioned schema. Search subscribes to CDC and rebuilds from snapshots when mappings change. If fulfillment accidentally consumed raw CDC, a bug that set status='paid' without inventory checks could still emit a row update—another reason not to mix money paths with unfiltered table streams.
Wire observability across both paths:
outbox_lag_seconds— time from insert to publish.cdc_connector_lag_bytes— slot lag.projector_apply_latency— consumer behind stream.
When I help teams harden event-driven platforms, these three metrics separate "database is slow" from "bus is slow" from "downstream can't keep up"—without that split, incidents turn into blind restarts.
Common mistakes and pitfalls
- Treating raw CDC as your public API — column leaks, unsafe renames, and internal states become breaking changes for every consumer.
- Ignoring replication slot disk risk — a dev connector left running against production can stall WAL recycling and take down the primary.
- Assuming exactly-once — without idempotent consumers and dedup keys, replays double-charge or double-ship.
- Publishing before expand-contract finishes — consumers see incompatible
afterimages mid-migration. - Using CDC for cross-service orchestration — saga compensation and timeouts belong in workflow code or explicit commands, not inferred from arbitrary
UPDATEs. - Hard deletes without tombstones — GDPR deletes are invisible on the stream; use soft deletes, outbox tombstones, or compacted topics with explicit delete events.
- Skipping snapshot planning — "we'll add search later" becomes a multi-hour snapshot during peak traffic.
- One topic for all tables — complicates ACLs, scaling, and consumer logic; partition by domain boundary.
Conclusion
PostgreSQL CDC turns committed row changes into a scalable, low-latency stream—ideal for projections, analytics, and internal sync when you accept tight schema coupling and real connector operations. The transactional outbox remains the better default for business-critical, semantically rich events that must commit atomically with domain state and evolve on your terms.
The durable lesson: pick the integration boundary deliberately. Use the outbox where the event is the product contract; use CDC where the table is the source of truth for derived data. Operate both with idempotent consumers, explicit lag metrics, and migration discipline—and you get event-driven systems that survive replays, connector restarts, and schema growth without silent data loss.
If you are standing up or untangling an event pipeline and want a second pair of eyes on slot sizing, outbox relay design, or consumer idempotency, get in touch—the goal is the same in every engagement: streams that fail loudly, recover predictably, and stay understandable six months after launch.
订阅邮件通讯
新文章发布时收到邮件。无垃圾信息 — 仅本博客的新文章通知。
由 Resend 发送,可在邮件中退订。