The transactional outbox: publishing events without losing data or doubling work

Why dual writes fail between databases and message brokers, how the outbox pattern fixes it with atomic commits, and how to run workers, polling, and trade-offs in production.

Autor: Matheus Palma7 min de lectura
Software engineeringArchitectureBackendReliabilityMessagingPostgreSQL

A user completes checkout. Your service updates orders to paid, then publishes OrderPaid to a broker so inventory, analytics, and email can react. The database commit succeeds; the publish call fails—or the process crashes before the publish. Now the ledger says “paid” but downstream systems never heard about it. Alternatively, you publish first and the DB rolls back: subscribers see an event for an order that does not exist. In freelance and product work, any workflow that combines durable state with asynchronous notification hits this tension: two resources (database and queue) cannot be updated in one atomic step unless you design for it.

The transactional outbox pattern stores outbound messages in the same database transaction as the business write, then relies on a separate process to deliver them to the broker. This article explains why naive “write then publish” fails, how the outbox works end to end, and what to watch for when you operate it in production.

The problem: dual writes and partial failure

Two common approaches are both wrong under failure:

  1. Database first, then publish. If the transaction commits and the publish fails, you have committed state without a matching message. Retrying publish from the request handler risks duplicates if the first publish actually succeeded but the client timed out.

  2. Publish first, then database. If the DB fails after a successful publish, consumers see events that do not match persisted truth.

Distributed transactions across a DB and Kafka (two-phase commit, XA) exist but are heavy, operationally painful, and a poor fit for many cloud-native stacks. The outbox avoids a cross-system atomic protocol by collapsing the message into the same atomic unit as the business data: one commit to one system.

Core idea: outbox as part of the transaction

Within a single database transaction you:

  1. Apply the business mutation (for example, insert or update rows).
  2. Insert one or more rows into an outbox table representing events to deliver.

Either both succeed or neither does. No cross-system atomicity is required at commit time.

A relay or dispatcher process (separate thread, worker, or sidecar) reads new outbox rows, publishes to the message broker, then marks them processed (or deletes them). If the worker crashes after publish but before the mark, at-least-once delivery to the broker is acceptable if consumers are idempotent—which you typically need anyway.

Schema and event shape

A minimal outbox table often includes:

ColumnPurpose
idSurrogate primary key (often bigserial/UUID).
aggregate_idBusiness key for ordering or debugging (optional but useful).
event_typeString discriminator for deserialization.
payloadJSON or binary serialized event body.
created_atFor monitoring lag and TTL policies.
published_atNull until successfully sent; supports retry of stuck rows.

Some teams store only the payload and derive type from a wrapper; others normalize for strict versioning. The important part is that the payload is immutable once written: corrections are new events, not edits to outbox rows.

Ordering and partitioning

If consumers must see events for the same aggregate in order, the relay should publish in aggregate_id order (or a single partition key derived from it) so the broker preserves per-key ordering. Global ordering across all aggregates is usually unnecessary and expensive.

Relay implementation strategies

Polling

The simplest approach: on a timer, SELECT ... WHERE published_at IS NULL ORDER BY id LIMIT N FOR UPDATE SKIP LOCKED (on PostgreSQL), publish each row, then update published_at. SKIP LOCKED lets multiple workers split the queue without double work.

Trade-off: Latency is bounded below by the poll interval unless you add a wake signal.

Log-based (CDC)

Tools such as Debezium read the database write-ahead log and emit change events. The outbox table becomes a changelog; an external connector pushes to Kafka. This reduces load on the database from polling and can lower latency.

Trade-off: Operating CDC is its own specialty—schemas, connectors, and failure modes must be owned explicitly.

In-process dispatcher

For smaller services, a background loop in the same process can drain the outbox. Survives restarts only if unprocessed rows remain; still correct because rows are durable.

Trade-off: Couples message delivery to application uptime; fine for many monoliths, weaker if the app is scaled down to zero.

Practical example: PostgreSQL transaction plus TypeScript

The following sketch shows the atomic business write + outbox insert. It uses a generic query helper and omits connection pooling details; production code would add structured logging, metrics, and retries around the relay only.

import type { PoolClient } from "pg";

type OrderPaidEvent = {
  eventType: "OrderPaid";
  orderId: string;
  amountCents: number;
  occurredAt: string;
};

export async function markOrderPaid(
  client: PoolClient,
  orderId: string,
  amountCents: number
): Promise<void> {
  await client.query("BEGIN");

  try {
    await client.query(
      `UPDATE orders SET status = 'paid', updated_at = now() WHERE id = $1`,
      [orderId]
    );

    const payload: OrderPaidEvent = {
      eventType: "OrderPaid",
      orderId,
      amountCents,
      occurredAt: new Date().toISOString(),
    };

    await client.query(
      `INSERT INTO outbox (aggregate_id, event_type, payload)
       VALUES ($1, $2, $3::jsonb)`,
      [orderId, payload.eventType, JSON.stringify(payload)]
    );

    await client.query("COMMIT");
  } catch (e) {
    await client.query("ROLLBACK");
    throw e;
  }
}

A separate relay (pseudocode) might look like:

async function drainOutboxOnce(client: PoolClient, publish: (topic: string, key: string, body: Buffer) => Promise<void>) {
  const { rows } = await client.query<{ id: string; aggregate_id: string; event_type: string; payload: unknown }>(
    `SELECT id, aggregate_id, event_type, payload
     FROM outbox
     WHERE published_at IS NULL
     ORDER BY id
     LIMIT 50
     FOR UPDATE SKIP LOCKED`
  );

  for (const row of rows) {
    const key = row.aggregate_id;
    const topic = `domain.${row.event_type.toLowerCase()}`;
    await publish(topic, key, Buffer.from(JSON.stringify(row.payload)));
    await client.query(`UPDATE outbox SET published_at = now() WHERE id = $1`, [row.id]);
  }
}

If publish succeeds and the process dies before UPDATE, the next drain may republish the same logical event. Hence subscribers must tolerate duplicates (idempotent handlers, natural keys, or deduplication stores).

Trade-offs and limitations

Operational surface area. You maintain the outbox table, relay health, lag alerts (oldest unpublished row age), and dead-letter handling for poison messages.

Backpressure. If publishing is slower than business writes, the outbox grows. You need capacity planning and possibly rate limits on the write side during incidents.

Not a replacement for sagas. The outbox ensures delivery of intent; cross-service compensations and long-running workflows still need orchestration or choreography patterns.

Broker-specific features. Exactly-once broker semantics do not remove the need for idempotent consumers when the relay retries.

Common mistakes and pitfalls

Updating the same outbox row. Treat rows as append-only for the relay’s lifecycle. Edits race with the dispatcher and confuse whether an event was emitted.

Omitting SKIP LOCKED or equivalent under concurrency. Multiple relay instances can otherwise block each other or duplicate effort depending on isolation level.

Huge payloads in the outbox. Store references (for example, S3 keys) for large blobs; keep the outbox row small for replication and observability.

Ignoring ordering guarantees. Publishing in primary-key order without aligning to aggregate_id can reorder events that belong to the same entity if inserts interleave.

No monitoring on lag. The outbox silently queues if the relay fails; alert on age of unpublished rows.

Conclusion

The transactional outbox solves dual-write failure between your database and your message bus by making outbound events first-class transactional data. One commit gives you a consistent story: either the business change and its notifications are recorded together, or neither is. The relay adds complexity, but that complexity is localized, testable, and far easier to reason about than cross-system two-phase commits or hope-based “publish after commit” calls.

Key takeaways:

  • Combine business writes and outbox inserts in a single database transaction
  • Run a dedicated relay with at-least-once delivery and idempotent consumers
  • Choose polling versus CDC based on scale and operational maturity
  • Monitor outbox lag and design for backpressure

Teams building scalable, production-ready services use this pattern widely for domain events, integration events, and reliable side effects. If you are evaluating messaging guarantees or end-to-end consistency in a system you are designing, the contact page is a straightforward way to reach out.

Suscríbete al boletín

Recibe un correo cuando se publiquen artículos nuevos. Sin spam — solo entradas nuevas de este blog.

Con Resend. Puedes darte de baja en cualquier correo.