Outbound webhook delivery at scale: signing payloads, retry budgets, and dead-letter operations

Ship customer-facing webhooks with at-least-once delivery, HMAC signing, bounded retries, idempotency-friendly event IDs, and operator workflows when endpoints stay broken.

Author: Matheus PalmaJune 11, 20269 min read

Software engineeringBackendAPI designWebhooksPostgreSQLTypeScriptSite reliability engineering

Your product finally exposes webhooks. A customer wires POST /hooks/acme into their ERP, you enqueue order.created after checkout, and the first week feels like a win. Then finance opens a ticket: the same invoice was booked three times. Another customer swears they never received subscription.canceled even though your dashboard shows “delivered.” Meanwhile, on-call is paging because one broken endpoint—returning 500 for twelve hours—has consumed half your worker pool with retries. Outbound webhooks are a distributed system you operate, not a fetch() you fire from a controller and forget.

This article covers how to design at-least-once delivery with verifiable payloads, bounded retries, and operator-grade dead-letter handling. The patterns mirror what you need when integrating payment processors or shipping webhooks inbound—except now you are the sender, and your customers’ endpoints are the unreliable dependency.

Delivery semantics: what you promise (and what you do not)

Most SaaS webhooks are at-least-once: an event may arrive more than once, but it should not silently disappear if your side is healthy. That contract pushes complexity to both sides:

Your platform must persist delivery attempts, retry on transient failures, and surface permanent failures.
Your customer must treat handlers as idempotent (dedupe on event_id, use idempotency keys on side effects).

At-most-once (fire-and-forget) is simpler but unacceptable for billing, inventory, or compliance events. Exactly-once end-to-end is a myth across HTTP; do not market it unless you control both endpoints and the storage layer.

Document the contract in your developer docs: HTTP methods, headers, retry schedule, timeout, signature algorithm, and which status codes count as success vs retry vs permanent failure.

Event model: stable IDs, versions, and envelopes

Every outbound message should be a versioned envelope with fields your customers can rely on for years:

Field	Purpose
`id`	Globally unique event id (UUIDv7 or ULID) for idempotent consumption
`type`	Dot-separated name (`order.created`) with a published schema per type
`api_version`	Breaking payload changes bump this; old types may coexist during migration
`created_at`	ISO-8601 timestamp of when the event was recorded, not first delivery attempt
`data`	Type-specific payload; avoid nesting critical ids only inside `data`

Generate id once when the business fact is committed, store it durably, and reuse it across retries. If you mint a new id per attempt, customers cannot dedupe—and you will eventually double-charge someone’s workflow.

Ordering: usually “none,” sometimes “per resource”

HTTP webhooks rarely guarantee global ordering. If order.updated can arrive before order.created, customers suffer. Mitigations:

Per-resource sequence numbers (order_id + monotonic sequence) so consumers can buffer or reject gaps.
Timestamps are not ordering—clock skew and retries make created_at a weak sort key.
Document that unrelated resources are unordered; do not imply FIFO unless you implement it (partitioned outbox per aggregate_id).

In consulting engagements, the expensive incidents almost always trace back to implicit ordering assumptions in the customer’s handler, not to TLS or JSON.

Signing payloads: HMAC over the raw body

Customers must verify that events originated from you and were not tampered with in transit. The industry-standard pattern is HMAC-SHA256 over the exact bytes of the request body, with a per-endpoint secret.

Typical headers:

X-Webhook-Id — same as envelope id
X-Webhook-Timestamp — Unix seconds when the delivery attempt was signed (used for replay windows on the receiver)
X-Webhook-Signature — v1=<hex> or t=<ts>,v1=<hex> style

Signing steps:

Read the serialized JSON body as a Buffer or Uint8Array—no pretty-print drift between sign and send.
Construct the signed string (example): ${timestamp}.${bodyUtf8}.
HMAC-SHA256(secret, signed_string) → hex digest.
Send the same bytes you signed.

Rotate secrets with dual-active verification windows: issue whsec_new, accept signatures from either secret for 7–14 days, then revoke the old secret. Never log secrets or include them in error payloads.

Retry policy: backoff, budgets, and classification

Retries turn a blip into a storm unless bounded. A practical default for B2B webhooks:

Outcome	HTTP status / condition	Action
Success	`2xx` within timeout	Mark delivered; stop
Retryable	`408`, `429`, `5xx`, connect timeout, reset	Schedule retry with backoff + jitter
Non-retryable	`400`, `401`, `403`, `404`, `410`, malformed URL	Dead-letter immediately
Ambiguous	`2xx` but body says error (avoid this)	Treat as success only if you document it

Use exponential backoff with full jitter: delay = random(0, min(cap, base * 2^attempt)). Example schedule with base=60s, cap=24h, max_attempts=12 spans roughly two days—enough for maintenance windows without infinite load.

Per-endpoint circuit breaking

If an endpoint fails continuously, pause new delivery attempts after N consecutive failures or M attempts in an hour. Surface a “disabled due to failures” state in the customer dashboard and email the admin. Without this, one bad URL becomes a denial-of-service against your own workers.

Concurrency and fairness

Shard work by endpoint_id so one slow customer cannot starve others. Cap in-flight deliveries per endpoint (for example, 5) and global worker concurrency separately.

Persistence: outbox, delivery log, and dead letters

The durable spine is usually three tables (names vary):

Events — immutable facts (id, type, payload, created_at).
Deliveries — one row per (event_id, endpoint_id) with state machine: pending → delivering → delivered | failed | dead_lettered.
Attempts — append-only log of each HTTP try (status, duration, error snippet, next_retry_at).

This mirrors the transactional outbox: write the business row and the event in one DB transaction; a relay process fans out to subscribed endpoints. Use FOR UPDATE SKIP LOCKED when claiming pending deliveries so multiple workers scale horizontally without double-sending the same attempt row—PostgreSQL’s row locks make this straightforward.

Dead-letter when retries are exhausted or the failure is permanent. Operators need:

Search by endpoint_id, event_type, time range
Manual replay (single event or batch) after the customer fixes their handler
Payload redaction in UI for PII-heavy events

Observability: metrics that prevent surprise pages

Instrument at minimum:

webhook_delivery_attempts_total{endpoint, outcome}
webhook_delivery_latency_seconds{endpoint} histogram
webhook_dead_lettered_total{endpoint, reason}
webhook_queue_depth gauge

Alert on dead-letter rate and oldest undelivered event age, not on single 500s. Trace each attempt with the same event_id in logs and spans so support can answer “what happened to evt_…?” without raw SQL.

Practical example: delivery worker with signing and bounded retries

The following TypeScript sketch shows a claim → sign → POST → record attempt loop. It intentionally omits subscription management and UI; the focus is the delivery state machine customers feel in production.

import { createHmac, randomUUID } from "node:crypto";

type DeliveryRow = {
  deliveryId: string;
  eventId: string;
  endpointId: string;
  url: string;
  secret: string;
  attempt: number;
  payload: Record<string, unknown>;
};

type Sql = {
  query: <T>(text: string, params?: unknown[]) => Promise<{ rows: T[] }>;
};

const MAX_ATTEMPTS = 12;
const BASE_DELAY_SEC = 60;
const CAP_DELAY_SEC = 86_400;
const REQUEST_TIMEOUT_MS = 10_000;

function envelope(eventId: string, type: string, data: Record<string, unknown>) {
  return {
    id: eventId,
    type,
    api_version: "2026-06-01",
    created_at: new Date().toISOString(),
    data,
  };
}

function signBody(secret: string, timestamp: number, body: string): string {
  const signed = `${timestamp}.${body}`;
  const digest = createHmac("sha256", secret).update(signed).digest("hex");
  return `t=${timestamp},v1=${digest}`;
}

function retryDelaySeconds(attempt: number): number {
  const exp = Math.min(CAP_DELAY_SEC, BASE_DELAY_SEC * 2 ** attempt);
  return Math.floor(Math.random() * exp);
}

function isRetryableStatus(status: number): boolean {
  return status === 408 || status === 429 || status >= 500;
}

export async function claimNextDelivery(db: Sql): Promise<DeliveryRow | null> {
  const { rows } = await db.query<DeliveryRow>(
  `UPDATE webhook_deliveries d
      SET state = 'delivering', locked_at = now(), lock_token = $1
    WHERE d.id = (
      SELECT id FROM webhook_deliveries
       WHERE state = 'pending'
         AND (next_retry_at IS NULL OR next_retry_at <= now())
       ORDER BY created_at
       FOR UPDATE SKIP LOCKED
       LIMIT 1
    )
    RETURNING d.id AS "deliveryId", d.event_id AS "eventId",
              d.endpoint_id AS "endpointId", d.url, d.secret,
              d.attempt, d.payload`,
    [randomUUID()],
  );
  return rows[0] ?? null;
}

export async function deliverOne(db: Sql, row: DeliveryRow): Promise<void> {
  const bodyObj = envelope(row.eventId, String(row.payload.type), row.payload.data as Record<string, unknown>);
  const body = JSON.stringify(bodyObj);
  const timestamp = Math.floor(Date.now() / 1000);
  const signature = signBody(row.secret, timestamp, body);

  const controller = new AbortController();
  const timer = setTimeout(() => controller.abort(), REQUEST_TIMEOUT_MS);

  let httpStatus = 0;
  let errorMessage: string | null = null;

  try {
    const res = await fetch(row.url, {
      method: "POST",
      headers: {
        "content-type": "application/json",
        "user-agent": "Acme-Webhooks/1.0",
        "x-webhook-id": row.eventId,
        "x-webhook-timestamp": String(timestamp),
        "x-webhook-signature": signature,
      },
      body,
      signal: controller.signal,
    });
    httpStatus = res.status;
    if (!res.ok) errorMessage = `HTTP ${res.status}`;
  } catch (e) {
    errorMessage = e instanceof Error ? e.message : "network error";
  } finally {
    clearTimeout(timer);
  }

  const success = httpStatus >= 200 && httpStatus < 300;
  const retryable = !success && (httpStatus === 0 || isRetryableStatus(httpStatus));
  const nextAttempt = row.attempt + 1;

  await db.query(
    `INSERT INTO webhook_delivery_attempts
       (delivery_id, attempt, http_status, error, duration_ms)
     VALUES ($1, $2, $3, $4, $5)`,
    [row.deliveryId, nextAttempt, httpStatus || null, errorMessage, null],
  );

  if (success) {
    await db.query(
      `UPDATE webhook_deliveries
          SET state = 'delivered', delivered_at = now(), attempt = $2
        WHERE id = $1`,
      [row.deliveryId, nextAttempt],
    );
    return;
  }

  if (!retryable || nextAttempt >= MAX_ATTEMPTS) {
    await db.query(
      `UPDATE webhook_deliveries
          SET state = 'dead_lettered', attempt = $2, last_error = $3
        WHERE id = $1`,
      [row.deliveryId, nextAttempt, errorMessage],
    );
    return;
  }

  const delaySec = retryDelaySeconds(nextAttempt);
  await db.query(
    `UPDATE webhook_deliveries
        SET state = 'pending',
            attempt = $2,
            next_retry_at = now() + ($3 || ' seconds')::interval,
            last_error = $4
      WHERE id = $1`,
    [row.deliveryId, nextAttempt, String(delaySec), errorMessage],
  );
}

Wire this behind a small worker fleet (or a cron-driven loop with concurrency limits). Expose replay by resetting state to pending, clearing next_retry_at, and optionally bumping a replay_count for audit.

Common mistakes and pitfalls

New event id per retry — breaks customer idempotency; one business fact must map to one id.
Signing parsed-then-reserialized JSON — whitespace and key order change the digest; sign the bytes you send.
Retrying 400 forever — wastes capacity; dead-letter and notify the customer.
No per-endpoint circuit breaker — one bad URL degrades delivery for everyone.
Treating 2xx with an error JSON body as failure — pick one rule; mixing confuses retries.
Unbounded worker concurrency — duplicates effort under load; use SKIP LOCKED claims and per-endpoint caps.
Missing operator replay — without it, support resets rows by hand in production.
Omitting 410 Gone handling — if a customer deletes a endpoint, stop retrying immediately.

Conclusion

Outbound webhooks reward the same discipline as payment APIs: durable events, cryptographic authenticity, honest retry semantics, and visibility when delivery fails for good. At-least-once delivery is the right default if you pair it with stable event ids and clear documentation; bounded backoff and circuit breaking keep one broken endpoint from becoming your incident.

Teams shipping integrations for the first time often underestimate the operations surface—dead letters, replays, and signing rotations are where production maturity shows up. Getting this layer right is high-leverage work when you are building platforms that other engineering teams depend on, whether in-house or as part of helping a client harden a B2B API for real customer endpoints.

Get an email when new articles are published. No spam — only new posts from this blog.