PostgreSQL change data capture in production: when WAL beats the outbox—and when it does not

Compare CDC from PostgreSQL logical replication to transactional outbox relays: ordering, schema evolution, operational cost, and patterns that keep event-driven backends correct under failure.

Autor: Matheus Palma9 min de leitura
Software engineeringArchitectureBackendPostgreSQLMessagingReliabilityEvent-driven architecture

Your team ships the transactional outbox, runs the relay worker, and still hits a ceiling: publishing lag grows with write volume, schema migrations require coordinated deploys, and every new microservice wants the same row-level stream. Meanwhile, another team points Debezium at the WAL and fans out orders changes to Kafka in minutes. Both approaches emit domain events—but they make different promises about ordering, schema safety, and who owns failure.

Change data capture (CDC) reads the database's write-ahead log (via logical replication) and turns committed row changes into a durable event stream. It is a powerful alternative—or complement—to application-level outbox tables. This article explains how CDC works on PostgreSQL, where it fits relative to the outbox, and what production teams get wrong when they treat the binlog as a product API without guardrails.

The problem CDC solves

Event-driven architectures need a trustworthy stream of facts: OrderPaid, InventoryReserved, UserDeactivated. Application code can write those facts explicitly (outbox) or the infrastructure can derive them from INSERT/UPDATE/DELETE on tables (CDC).

CDC shines when:

  • Many consumers need the same table's changes (search indexers, caches, analytics, audit).
  • You want to decouple event emission from application deploys—the stream exists even if the API service forgets to publish.
  • Write volume makes polling an outbox table expensive or slow.

CDC is a poor default when:

  • Events are rich domain objects that do not map 1:1 to rows (aggregates spanning tables, computed fields, privacy redaction).
  • You need strict business semantics in the payload (OrderCancelledBecausePaymentFailed vs a raw status column flip).
  • Your organization is not ready to operate connectors, schema registries, and replay tooling.

Understanding that split keeps you from adopting CDC because it is fashionable, or rejecting it because "we already have an outbox."

How PostgreSQL logical replication enables CDC

PostgreSQL physical replication copies disk blocks; logical replication decodes the WAL into logical change records for subscribed tables. Tools like Debezium (with the PostgreSQL connector) or managed offerings (AWS DMS, GCP Datastream, etc.) sit on that stream and emit JSON/Avro events to Kafka, Pulsar, or webhooks.

At a high level:

flowchart LR
  App[Application writes] --> PG[(PostgreSQL)]
  PG --> WAL[WAL]
  WAL --> Slot[Replication slot]
  Slot --> Connector[CDC connector]
  Connector --> Bus[Message bus]
  Bus --> C1[Search indexer]
  Bus --> C2[Cache projector]
  Bus --> C3[Analytics]

Key mechanics:

  1. Publication — defines which tables (or columns) are replicated.
  2. Replication slot — tracks consumer progress; WAL is retained until the slot catches up (a common disk-pressure failure mode).
  3. Connector — reads changes, maps types, optionally routes by table/topic, handles heartbeats.

Committed transactions appear in commit order for a single database, which is valuable for projectors building read models—though not a substitute for cross-table saga ordering without additional design.

What a CDC event looks like

Debezium-style payloads typically include before and after row images, operation type (c/u/d), source metadata (LSN, transaction id, table), and timestamps. Consumers branch on op and apply idempotent upserts:

{
  "op": "u",
  "before": { "id": "ord_42", "status": "pending", "total_cents": 9900 },
  "after": { "id": "ord_42", "status": "paid", "total_cents": 9900 },
  "source": {
    "table": "orders",
    "lsn": 2847362519,
    "txId": 812044
  }
}

That is excellent for data integration; it is often awkward as a public integration contract for partner teams who should not see internal column names or intermediate states.

CDC versus transactional outbox

DimensionTransactional outboxWAL / CDC
Atomicity with business writeSame transaction as domain rowImplicit—only committed rows appear
Event shapeYou design JSON schemasRow images; may need transforms
Coupling to schemaLoose—payload is your APITight—column renames break consumers
Operational ownerApp team (table + worker)Platform / data / SRE
LatencyPolling or NOTIFY-boundTypically lower, streaming
Deletes & GDPRExplicit tombstone eventsHard deletes vanish unless you model them
Multi-table aggregatesOne outbox row per business eventRequires joins or enrichment downstream

In consulting engagements, I usually recommend outbox first when the event is a product contract ("notify fulfillment when paid") and CDC when the problem is keeping derived stores in sync with OLTP tables. Mature systems often run both: outbox for partner-facing domain events, CDC for internal projections—provided you document which stream is authoritative for what.

Designing a production CDC pipeline

Publications and least privilege

Create a dedicated replication user with REPLICATION privilege and a publication scoped to required tables—never FOR ALL TABLES in shared databases unless you accept accidental data leakage when someone adds a secrets table.

CREATE PUBLICATION app_cdc FOR TABLE orders, order_lines, inventory_levels;

CREATE USER cdc_replicator WITH REPLICATION LOGIN PASSWORD '...';
GRANT SELECT ON orders, order_lines, inventory_levels TO cdc_replicator;

Column-level publications (PostgreSQL 15+) help when only a subset should leave the database boundary.

Replication slots and WAL retention

A stalled connector holds a slot; PostgreSQL retains WAL, disk fills, writes block. Runbooks need:

  • Alerts on replication lag (bytes behind, time behind).
  • Maximum slot age policies and on-call steps to drop or reset a poisoned slot (with explicit acceptance of re-snapshot cost).
  • Heartbeats from the connector during idle periods so slots advance.

This is the CDC equivalent of "our outbox worker died three days ago"—except the blast radius is cluster-wide storage.

Schema evolution

Row-based events break when you RENAME COLUMN or change types without coordination. Mitigations:

  1. Expand-contract migrations — add new column, dual-write, migrate consumers, drop old column (same discipline as zero-downtime API changes).
  2. Schema registry — Avro/Protobuf with compatibility rules for serialized payloads after an enrich/transform step.
  3. Event enrichment — a stream processor maps status = 'paid' to OrderPaid v2, hiding storage details.

Treat CDC topics as internal integration until a transform layer produces versioned domain events.

Ordering, duplicates, and idempotency

CDC delivers at-least-once. Connectors may restart from an earlier LSN; consumers must be idempotent (natural keys, UPSERT, or idempotency keys in a store). Per-partition ordering in Kafka helps if you key by aggregate_id, but cross-partition ordering is not guaranteed.

For read-model projectors, store source LSN or transaction id and ignore stale updates:

type OrderRow = { id: string; status: string; total_cents: number };

type DebeziumUpdate = {
  op: "c" | "u" | "d";
  before: OrderRow | null;
  after: OrderRow | null;
  source: { lsn: number; table: string };
};

export async function applyOrderChange(
  db: { query: (sql: string, params: unknown[]) => Promise<unknown> },
  event: DebeziumUpdate,
): Promise<void> {
  if (event.op === "d") {
    await db.query(
      `DELETE FROM read_orders WHERE id = $1 AND last_lsn < $2`,
      [event.before!.id, event.source.lsn],
    );
    return;
  }

  const row = event.after!;
  await db.query(
    `
    INSERT INTO read_orders (id, status, total_cents, last_lsn)
    VALUES ($1, $2, $3, $4)
    ON CONFLICT (id) DO UPDATE
      SET status = EXCLUDED.status,
          total_cents = EXCLUDED.total_cents,
          last_lsn = EXCLUDED.last_lsn
      WHERE read_orders.last_lsn < EXCLUDED.last_lsn
    `,
    [row.id, row.status, row.total_cents, event.source.lsn],
  );
}

The last_lsn guard prevents an older event from overwriting newer state after a replay or out-of-order edge case.

Initial snapshots

New consumers need a baseline. Debezium can snapshot tables before streaming; large tables require chunked reads, IO awareness, and sometimes off-peak windows. Plan snapshot duration in the same way you plan backfills for new read models—CDC does not remove that cost, it shifts it to connector configuration.

Practical example: hybrid outbox + CDC boundary

Consider checkout: the API must reliably tell the warehouse service an order is paid (business event), and a search service must index order rows for the admin UI (data sync).

Pattern:

  1. In one transaction: update orders.status, insert into outbox with { type: "OrderPaid", orderId, paidAt }.
  2. Outbox relay publishes OrderPaid to orders.events for fulfillment.
  3. CDC streams orders row changes to orders.cdc for search/analytics.
  4. Search projector consumes orders.cdc only; it never drives side effects that charge money.
import type { Pool, PoolClient } from "pg";

type OutboxRow = {
  aggregate_id: string;
  event_type: string;
  payload: Record<string, unknown>;
};

export async function markOrderPaid(
  pool: Pool,
  orderId: string,
  paidAt: string,
): Promise<void> {
  const client: PoolClient = await pool.connect();
  try {
    await client.query("BEGIN");

    const updated = await client.query<{ id: string }>(
      `
      UPDATE orders
      SET status = 'paid', paid_at = $2
      WHERE id = $1 AND status = 'pending'
      RETURNING id
      `,
      [orderId, paidAt],
    );

    if (updated.rowCount === 0) {
      await client.query("ROLLBACK");
      throw new Error("order not pending or not found");
    }

    const outbox: OutboxRow = {
      aggregate_id: orderId,
      event_type: "OrderPaid",
      payload: { orderId, paidAt },
    };

    await client.query(
      `
      INSERT INTO outbox (aggregate_id, event_type, payload)
      VALUES ($1, $2, $3::jsonb)
      `,
      [outbox.aggregate_id, outbox.event_type, JSON.stringify(outbox.payload)],
    );

    await client.query("COMMIT");
  } catch (e) {
    await client.query("ROLLBACK");
    throw e;
  } finally {
    client.release();
  }
}

Fulfillment subscribes to outbox-derived OrderPaid with a versioned schema. Search subscribes to CDC and rebuilds from snapshots when mappings change. If fulfillment accidentally consumed raw CDC, a bug that set status='paid' without inventory checks could still emit a row update—another reason not to mix money paths with unfiltered table streams.

Wire observability across both paths:

  • outbox_lag_seconds — time from insert to publish.
  • cdc_connector_lag_bytes — slot lag.
  • projector_apply_latency — consumer behind stream.

When I help teams harden event-driven platforms, these three metrics separate "database is slow" from "bus is slow" from "downstream can't keep up"—without that split, incidents turn into blind restarts.

Common mistakes and pitfalls

  1. Treating raw CDC as your public API — column leaks, unsafe renames, and internal states become breaking changes for every consumer.
  2. Ignoring replication slot disk risk — a dev connector left running against production can stall WAL recycling and take down the primary.
  3. Assuming exactly-once — without idempotent consumers and dedup keys, replays double-charge or double-ship.
  4. Publishing before expand-contract finishes — consumers see incompatible after images mid-migration.
  5. Using CDC for cross-service orchestration — saga compensation and timeouts belong in workflow code or explicit commands, not inferred from arbitrary UPDATEs.
  6. Hard deletes without tombstones — GDPR deletes are invisible on the stream; use soft deletes, outbox tombstones, or compacted topics with explicit delete events.
  7. Skipping snapshot planning — "we'll add search later" becomes a multi-hour snapshot during peak traffic.
  8. One topic for all tables — complicates ACLs, scaling, and consumer logic; partition by domain boundary.

Conclusion

PostgreSQL CDC turns committed row changes into a scalable, low-latency stream—ideal for projections, analytics, and internal sync when you accept tight schema coupling and real connector operations. The transactional outbox remains the better default for business-critical, semantically rich events that must commit atomically with domain state and evolve on your terms.

The durable lesson: pick the integration boundary deliberately. Use the outbox where the event is the product contract; use CDC where the table is the source of truth for derived data. Operate both with idempotent consumers, explicit lag metrics, and migration discipline—and you get event-driven systems that survive replays, connector restarts, and schema growth without silent data loss.

If you are standing up or untangling an event pipeline and want a second pair of eyes on slot sizing, outbox relay design, or consumer idempotency, get in touch—the goal is the same in every engagement: streams that fail loudly, recover predictably, and stay understandable six months after launch.

Assine a newsletter

Receba um e-mail quando novos artigos forem publicados. Sem spam — apenas novos posts deste blog.

Via Resend. Você pode cancelar a inscrição em qualquer e-mail.