Change data capture for read models: PostgreSQL logical replication, outbox relays, and production trade-offs
How CDC streams database writes into search indexes and analytics without dual writes, when to choose log-based capture vs transactional outbox, and operational pitfalls in production.
A product team ships a new dashboard that queries Elasticsearch for faceted search across orders, customers, and line items. The write path still lives in PostgreSQL—the system of record. The first implementation updates the database, then calls the search cluster in the same request handler. Under load, partial failures appear: rows exist in Postgres that never indexed, or stale documents linger after deletes. Retries double-index; compensating jobs run nightly to reconcile drift. In consulting work, this is the moment teams realize they need a durable stream of changes rather than another synchronous hop in the hot path.
Change data capture (CDC) turns committed database mutations into an ordered event stream consumers can project into read models, caches, and warehouses. Done well, you stop dual-writing from application code; done poorly, you inherit connector lag, schema breakage, and ordering bugs that are harder to debug than a failed HTTP call. This article explains how CDC fits next to patterns you may already use (transactional outbox, polling workers), what PostgreSQL logical replication gives you, and how to operate the pipeline in production.
Why dual writes fail—and what CDC changes
The failure mode is familiar: two durable systems updated from application code without a single atomic commit.
- Database commits, downstream fails. Postgres has the truth; Elasticsearch, Redis, or a warehouse does not.
- Downstream succeeds, database rolls back. Consumers see ghosts.
- Timeouts obscure outcome. The client retries; without deduplication, side effects multiply.
The transactional outbox solves this by persisting outbound events in the same transaction as the business write, then relaying them asynchronously. CDC is a different entry point to the same goal: the database log becomes the source of outbound events, and application handlers stop calling brokers or search APIs directly.
CDC does not eliminate the need for idempotent consumers or schema discipline. It relocates responsibility: writers only touch Postgres; projectors own downstream consistency.
CDC vs transactional outbox: when to use which
Both patterns avoid dual writes. The trade-off is who produces the event stream and what operational surface you accept.
| Dimension | Transactional outbox | Log-based CDC |
|---|---|---|
| Event shape | Explicit domain events you design | Row-level change records (often debezium-style envelopes) |
| Coupling to schema | Loose—payload is your contract | Tight—table/column changes affect consumers |
| Application code | Insert outbox row in same TX | No publish code in handlers |
| Delivery guarantee | At-least-once via relay | At-least-once from replication slot |
| Ops complexity | Polling or NOTIFY relay you run | Connectors, slots, WAL retention, monitoring |
| Best fit | Rich domain events, bounded contexts | Many tables, analytics/search projections, legacy schemas |
In practice, mature systems often combine them: outbox for explicit domain events consumed by other services; CDC for read-model hydration where row mirrors are enough. I have seen teams start with outbox (full control, smaller blast radius) and add CDC when projection breadth outgrows hand-written relays—inventory, search, CRM sync, and BI all want the same commits.
Ordering and causality
Consumers usually need per-aggregate ordering (all changes for order_id = 42 processed in commit order), not global total order across all tables. CDC streams are ordered per replication origin; partition projectors by business key (aggregate_id, tenant_id, primary key hash) so one slow partition does not block unrelated work.
For cross-table invariants (order header + line items in one user-visible operation), prefer one transaction in Postgres so the log reflects a single commit boundary, or emit a single outbox event that carries the aggregate snapshot. Stitching unrelated row events after the fact is fragile.
PostgreSQL logical replication as a CDC source
PostgreSQL can stream logical change records to a replication slot subscriber. Managed offerings (RDS, Cloud SQL, Aurora) expose similar capabilities with vendor-specific constraints; self-hosted Postgres uses pgoutput or plugins like wal2json / Debezium's PostgreSQL connector.
What you receive
A logical decoding message typically includes:
- Operation —
INSERT,UPDATE,DELETE(and sometimesTRUNCATE) - Schema and table — identifiers for routing
- Before/after row images — depending on
REPLICA IDENTITYand publication settings - LSN — log sequence number for offset tracking
REPLICA IDENTITY FULL on tables where you need old column values on UPDATE/DELETE for correct projections; it increases WAL volume. Default DEFAULT uses primary key only—enough when consumers only need new state.
Publications and slots
You define a publication listing tables (or FOR ALL TABLES with care). A logical replication slot reserves WAL until the consumer acknowledges progress. If the connector stops and WAL accumulates, disk fills—one of the most common production incidents.
Operational checklist:
- Monitor replication lag (bytes behind, time behind)
- Alert on slot inactive beyond a threshold
- Cap connector downtime with runbooks to drop or advance slots only when you accept re-snapshot cost
- Separate slots per consumer class (search vs warehouse) so one slow BI job does not starve search
Initial snapshot + streaming
Connectors usually perform a consistent snapshot (often via EXPORT SNAPSHOT semantics), then switch to streaming. Large tables need chunked snapshot configuration and enough I/O headroom; scheduling initial loads off peak avoids saturating the primary.
Projector design: from change records to read models
A projector is a consumer that applies CDC events to a derived store. Treat it like any other at-least-once worker:
- Deserialize envelope → typed change
- Route by table/tenant to the correct handler
- Apply idempotently to the read model
- Commit consumer offset after successful apply (or use external offset store with the same ordering guarantees your stack provides)
Idempotency strategies
- Natural keys — upsert search documents by
order_id - Version columns — ignore events where
event_lsn≤ storedsource_lsn - Delete tombstones — propagate deletes; soft-delete in source requires explicit handling
Denormalization
Search and analytics models often embed related entities (customer name on an order document). CDC gives you row-level facts; projectors join in flight or maintain side tables in the read store updated from multiple source tables. Document which embeds are eventually consistent with what SLA—dashboard users tolerate seconds; fraud checks may not.
Backfill and replay
Keep the ability to rebuild a read model from snapshot + stream (or snapshot-only for disposable indexes). Store enough in the envelope to reprocess without reading Postgres again when possible. Version your projector code; breaking mapping changes may require a blue/green index (new index alias cutover) rather than in-place mutation.
Practical example: Node.js projector sketch
The following is a simplified consumer loop—not a production connector, but the control flow you want: parse, idempotent upsert, offset commit, structured errors. A real deployment would use Debezium/Kafka, logical replication via pg_logical, or a managed CDC service; the projector logic stays similar.
import type { Pool } from "pg";
type ChangeOp = "c" | "u" | "d"; // create, update, delete (Debezium-style)
interface OrderRow {
id: string;
tenant_id: string;
status: string;
total_cents: number;
updated_at: string;
}
interface CdcEnvelope {
op: ChangeOp;
source: { table: string; lsn: number };
before: OrderRow | null;
after: OrderRow | null;
}
interface SearchClient {
upsertOrder(doc: { id: string; tenantId: string; status: string; totalCents: number; sourceLsn: number }): Promise<void>;
deleteOrder(id: string, tenantId: string): Promise<void>;
}
export async function projectOrderChange(
env: CdcEnvelope,
search: SearchClient,
stateDb: Pool,
): Promise<void> {
if (env.source.table !== "orders") return;
const row = env.after ?? env.before;
if (!row) return;
const client = await stateDb.connect();
try {
await client.query("BEGIN");
const { rows } = await client.query<{ last_lsn: string | null }>(
`SELECT last_lsn FROM projector_offsets
WHERE projector = $1 AND entity_id = $2 AND tenant_id = $3
FOR UPDATE`,
["orders-search", row.id, row.tenant_id],
);
const last = rows[0]?.last_lsn ? Number(rows[0].last_lsn) : -1;
if (env.source.lsn <= last) {
await client.query("COMMIT");
return; // already applied
}
if (env.op === "d") {
await search.deleteOrder(row.id, row.tenant_id);
} else if (env.after) {
await search.upsertOrder({
id: env.after.id,
tenantId: env.after.tenant_id,
status: env.after.status,
totalCents: env.after.total_cents,
sourceLsn: env.source.lsn,
});
}
await client.query(
`INSERT INTO projector_offsets (projector, entity_id, tenant_id, last_lsn)
VALUES ($1, $2, $3, $4)
ON CONFLICT (projector, entity_id, tenant_id)
DO UPDATE SET last_lsn = EXCLUDED.last_lsn`,
["orders-search", row.id, row.tenant_id, String(env.source.lsn)],
);
await client.query("COMMIT");
} catch (err) {
await client.query("ROLLBACK");
throw err;
} finally {
client.release();
}
}
Note the per-entity offset stored in SQL, not only the connector's global LSN. Global offsets are fine for append-only warehouses; fine-grained idempotency helps when you replay a range or run multiple projectors with different speeds.
Wire metrics: cdc_lag_seconds, projector_apply_duration_ms, projector_skipped_stale, projector_failures_total labeled by table and tenant. Alert on lag growth, not on every transient retry.
Schema changes under CDC
Table migrations are the long-term tax on CDC-heavy systems.
- Additive columns — safe if consumers ignore unknown fields
- Renames — often appear as drop + add at the physical level; use expand-contract migrations and dual-write periods, or emit compatibility shims in the connector layer
- Type changes — may require full re-snapshot
- Splitting tables — plan a new publication and parallel projector until cutover
Coordinate with the same expand/contract discipline described for zero-downtime database migrations: the log reflects physical storage, not your domain model's intent.
Security and multi-tenancy
CDC streams contain raw row data. Restrict access to the bus (Kafka ACLs, private networking), encrypt at rest, and avoid landing PII in analytics without purpose limitation. For multi-tenant SaaS, partition topics by tenant_id where isolation requirements demand it, and enforce tenant filters in projectors even if the bus is shared—defense in depth beats trusting one misconfigured subscription.
Limitations you should accept upfront
Eventually consistent read models. CDC introduces lag—milliseconds to seconds under normal load, more during incidents. APIs that need read-your-writes immediately after a mutation must route those reads to the primary or use routing tokens, not the search index.
Not a replacement for service-to-service domain events. Row changes expose storage shape, not necessarily meaningful business events ("OrderShipped" may require joining shipment tables). Use outbox or application events when bounded contexts need stable contracts.
Operational ownership. Slots, WAL, connector restarts, and schema upgrades are on-call surfaces. If the team cannot own that, a managed CDC pipeline or staying with outbox-only may be cheaper than incident load.
Common mistakes and pitfalls
Treating CDC as exactly-once. Delivery is at-least-once unless you build expensive exactly-once semantics end-to-end. Idempotent projectors are mandatory, not optional.
One global consumer for all tables. A poison message or slow handler blocks everything. Isolate by domain table set and use dead-letter queues with replay tooling.
Ignoring REPLICA IDENTITY. Updates that need prior values for partial document updates will silently wrong-shape data if only the primary key is replicated.
Letting replication slots go stale. Disk pressure on the primary can take down writes for the entire product. Automate alerts; practice slot recovery in staging.
Projecting directly from request handlers anyway. Teams add CDC but keep synchronous Elasticsearch calls "for speed," reintroducing dual writes. Pick a consistency model per read path and stick to it.
Skipping rebuild drills. Without a tested full reindex path, the first corrupted index becomes a multi-day outage instead of a runbooked alias swap.
Conclusion
Change data capture moves downstream synchronization out of the request path and onto a durable log derived from the database itself. Paired with idempotent projectors, per-entity versioning, and honest SLAs on read models, it scales search, analytics, and cache hydration without multiplying failure modes in user-facing handlers.
Key takeaways:
- Prefer outbox when you need explicit domain events; prefer CDC when many tables feed read models and you can own connector operations
- Design projectors for at-least-once delivery with LSN or per-entity deduplication
- Treat replication slot lag and schema migrations as first-class operational concerns, not afterthoughts
- Keep a rebuild path for every derived index or warehouse table
For teams building scalable, production-ready data planes—whether migrating off dual writes or standing up search alongside Postgres—the choice between outbox, CDC, or both should be driven by contract clarity and who will carry the pager for the stream. If you want help mapping that boundary for your stack, the contact page is the right place to start.
Suscríbete al boletín
Recibe un correo cuando se publiquen artículos nuevos. Sin spam — solo entradas nuevas de este blog.
Con Resend. Puedes darte de baja en cualquier correo.