Outbound webhook delivery: signing, retries, and operational guarantees

How to ship webhook notifications customers can trust: HMAC signing, delivery IDs, exponential backoff with jitter, DLQs, and dashboards that make at-least-once delivery survivable.

Autor: Matheus Palma9 de junio de 20269 min de lectura

Software engineeringBackendAPI designReliabilityWebhooksTypeScript

Your SaaS exposes a webhook so customers can react to order.created in their own systems. A merchant’s endpoint is down for twenty minutes during a deploy. When it comes back, they expect every event they missed—or at least a clear way to reconcile. Instead they get a burst of duplicates, a few permanently lost notifications, and a support ticket claiming your platform “doesn’t deliver webhooks reliably.” In consulting engagements, outbound webhook delivery is where API product promises meet queue semantics: you are building a miniature message broker with per-tenant endpoints, uneven SLAs, and zero control over the receiver’s quality.

This article covers the sender side: how to sign payloads, choose retry and response semantics, deduplicate on your side, and operate delivery with metrics that reflect customer reality. It pairs naturally with receiver-side verification and idempotency; together they form an end-to-end contract both teams can implement without guessing.

Why outbound webhooks are harder than “POST on state change”

A naïve implementation fires fetch(customerUrl, { body: event }) inside the request handler that committed business state. That fails in predictable ways:

Coupling latency to customer uptime. A slow or hung subscriber blocks your checkout or signup path, or you silently drop events when you move the HTTP call to a fire-and-forget task without durability.
Partial failure ambiguity. Your process crashes after the customer returns 200 but before you record success—or after you record success but before they finish processing. Retries and support both become guesswork.
Thundering retries. Fixed-interval retries align across tenants and can DDoS a recovering endpoint, triggering more failures and more retries.
Security asymmetry. You sign nothing; customers cannot distinguish your traffic from an attacker with a guessed URL.

Production webhook systems therefore separate event capture (durable, in your transaction boundary) from delivery (async workers with explicit retry policy) and treat HTTP status codes as signals, not proof of business processing.

Architecture: capture, queue, deliver, reconcile

A durable pipeline has four stages:

Stage	Responsibility	Failure mode to design for
Capture	Persist “this event must be delivered” atomically with business write	Lost events if not transactional
Enqueue	Hand off to a delivery subsystem (outbox relay, queue, or jobs table)	Stuck rows, duplicate enqueue
Deliver	HTTP POST with signing, timeouts, and response classification	Timeouts, 5xx, rate limits
Reconcile	Dashboards, manual replay, optional event backfill API	Customer disputes, poison endpoints

The transactional outbox pattern fits capture well: insert into webhook_deliveries (or a generic outbox) in the same database transaction as the domain mutation, then let workers drain the table. If you already publish domain events to Kafka, a dedicated webhook projector can fan out per subscription—same delivery semantics, different ingress.

Per-subscription configuration

Each customer endpoint needs metadata you enforce at delivery time:

URL (HTTPS only in production; reject private IP ranges unless you operate a secure tunnel product).
Signing secret (generated per endpoint; rotatable without downtime if you support overlapping secrets).
Event filters (order.created but not order.updated).
Retry policy (max attempts, backoff ceiling, whether to disable after repeated failure).
Custom headers (some enterprises require static API keys in addition to signatures—document that signatures are the authenticity guarantee).

Store secrets encrypted at rest. Never log the raw secret or full signed payload in production logs.

Signing: give subscribers a verification story

Mirror what mature providers do: sign the raw body with HMAC-SHA256 using a per-endpoint secret, and include a timestamp so subscribers can bound replay windows.

Recommended headers (names are illustrative; document yours and version them):

X-Webhook-Id — unique delivery attempt id (UUID).
X-Webhook-Timestamp — Unix seconds when the signature was computed.
X-Webhook-Signature — v1=<hex> HMAC over timestamp + "." + rawBody.

import crypto from "node:crypto";

export function signWebhookPayload(
  secret: string,
  rawBody: string,
  timestampSec: number,
): { signature: string; timestamp: string } {
  const timestamp = String(timestampSec);
  const signedPayload = `${timestamp}.${rawBody}`;
  const digest = crypto.createHmac("sha256", secret).update(signedPayload, "utf8").digest("hex");
  return { signature: `v1=${digest}`, timestamp };
}

Why timestamp in the signed string: subscribers reject stale requests even if an attacker replays an old body. Recommend a ±5 minute skew window and document clock sync expectations.

Publish a verification guide with copy-paste examples in Node, Python, and Go. Receiver implementations that parse JSON before verifying are the most common integration bug on the customer side; your docs should scream “verify the raw bytes first.”

Delivery worker: timeouts, classification, and idempotency

Workers should treat each delivery as an idempotent HTTP attempt keyed by X-Webhook-Id. Subscribers dedupe on that header (or a body field you also include, such as event_id).

Classify HTTP outcomes explicitly

Response	Worker action	Rationale
`2xx` within timeout	Mark delivered; stop retries	Subscriber accepted responsibility to process
`410 Gone`	Disable subscription; alert customer	Endpoint permanently removed
`429`	Retry respecting `Retry-After` if present	Subscriber asked for backoff
`408`, `5xx`, network error	Retry with backoff	Transient or unknown
`4xx` (except 408/429)	Retry limited times, then dead-letter	Likely misconfiguration; don’t retry forever

Use short client timeouts (for example 5–10 s connect + response). A hung subscriber must not hold worker threads. This is independent of how long the customer takes to process the event after returning 200—that is their problem unless you offer a callback protocol.

Exponential backoff with jitter

Retries without jitter synchronize across your fleet and hammer a recovering site. A practical schedule:

Base delay: min(cap, base * 2^attempt) + random(0, base)
Cap: 24–72 hours total retry window for SaaS products, depending on contract
Max attempts: 8–12 over that window

export function nextRetryDelayMs(attempt: number, baseMs = 60_000, capMs = 3_600_000): number {
  const exp = Math.min(capMs, baseMs * 2 ** attempt);
  const jitter = Math.floor(Math.random() * baseMs);
  return exp + jitter;
}

Persist next_attempt_at on each delivery row so workers can scan due work with an index-friendly query instead of sleeping in memory.

At-least-once is the honest guarantee

You will deliver at least once. Document that clearly. Subscribers must implement idempotent handlers keyed by event_id or X-Webhook-Id. Your job is to avoid unnecessary duplicates (don’t retry 2xx) while accepting that crash windows may produce a second POST after an unrecorded 200.

If you need stronger semantics, add an events backfill API (GET /v1/events?since=cursor) so customers can reconcile without relying solely on push delivery.

Practical example: outbox row to signed POST

The following sketch ties capture, signing, and delivery classification into one worker loop. It omits framework wiring and uses a single webhook_deliveries table; adapt types to your ORM.

import crypto from "node:crypto";

type DeliveryRow = {
  id: string;
  subscription_id: string;
  event_id: string;
  event_type: string;
  payload: Record<string, unknown>;
  attempt: number;
  target_url: string;
  signing_secret: string;
};

type DeliveryResult = "delivered" | "retry" | "dead_letter" | "disabled";

function buildBody(row: DeliveryRow): string {
  return JSON.stringify({
    id: row.event_id,
    type: row.event_type,
    data: row.payload,
    created_at: new Date().toISOString(),
  });
}

export async function attemptDelivery(row: DeliveryRow): Promise<DeliveryResult> {
  const rawBody = buildBody(row);
  const ts = Math.floor(Date.now() / 1000);
  const { signature, timestamp } = signWebhookPayload(row.signing_secret, rawBody, ts);

  const controller = new AbortController();
  const timeout = setTimeout(() => controller.abort(), 10_000);

  try {
    const res = await fetch(row.target_url, {
      method: "POST",
      headers: {
        "content-type": "application/json",
        "user-agent": "YourProduct-Webhooks/1.0",
        "x-webhook-id": row.id,
        "x-webhook-timestamp": timestamp,
        "x-webhook-signature": signature,
      },
      body: rawBody,
      signal: controller.signal,
    });

    if (res.status >= 200 && res.status < 300) return "delivered";
    if (res.status === 410) return "disabled";
    if (res.status === 429 || res.status >= 500 || res.status === 408) return "retry";
    if (res.status >= 400) return row.attempt >= 3 ? "dead_letter" : "retry";
    return "retry";
  } catch {
    return "retry";
  } finally {
    clearTimeout(timeout);
  }
}

// Worker: SELECT ... WHERE status = 'pending' AND next_attempt_at <= now() FOR UPDATE SKIP LOCKED

Use FOR UPDATE SKIP LOCKED (or equivalent lease columns) so horizontal workers do not double-deliver the same row in the same attempt window. Combine with a fencing style attempt counter in the payload or headers if subscribers want to detect out-of-order retries.

Operations: what to measure and show customers

Internal metrics that correlate with support tickets:

Delivery lag — now - event.created_at at success; p50/p95 per event type.
Attempt histogram — how many deliveries succeed on first try vs retry.
Terminal failure rate — rows reaching dead-letter per subscription.
Disable rate — subscriptions auto-disabled after policy thresholds.

Customer-facing dashboards should show recent deliveries with status, HTTP code, attempt count, and a manual replay button that enqueues a new delivery row (new X-Webhook-Id, same event_id) without re-emitting domain side effects. Replays are essential when a customer fixes a bug in their handler and needs historical events.

Alert on:

Sudden spike in 4xx for a single subscription (misdeploy on their side).
Global spike in 5xx to many URLs (your egress or DNS problem).
Outbox depth growing monotonically (worker outage or DB lock contention).

Common mistakes and pitfalls

Synchronous delivery in the request path. It couples your SLA to every customer’s worst endpoint. Always decouple with durable capture.
Retrying forever on 4xx. A bad URL or auth misconfiguration will never succeed; you waste resources and erode trust. Cap attempts and surface failures visibly.
No delivery id separate from business event id. Retries should be distinguishable for debugging; business id stays stable for subscriber idempotency.
Logging full payloads containing PII. Webhook bodies often include emails and addresses. Log ids and types; offer secure replay in the dashboard instead.
Allowing http:// or internal URLs without SSRF controls. Attackers subscribe http://169.254.169.254/ and turn your workers into a probe. Enforce HTTPS, block link-local and RFC1918 targets, or route through a controlled egress proxy.
Treating 200 with an error body as success. Some frameworks always return 200. Document that only status codes matter unless you define a structured error contract—and even then, prefer proper HTTP semantics.
Secret rotation without overlap. Support two active signing secrets during rotation so subscribers can update verifiers without missing events.

Conclusion

Outbound webhooks are a product surface, not a side effect. Durable capture, signed payloads, disciplined retry classification, and operator-visible delivery state turn “we POST JSON when something happens” into something enterprises can build on. Assume at-least-once delivery, make duplicates harmless via stable event ids, and give customers reconciliation tools when push fails.

The patterns here—outbox capture, HMAC signing, backoff with jitter, and dead-letter handling—show up across billing platforms, CI systems, and integration marketplaces. Getting them right early avoids painful retrofits when your first large customer asks for a delivery log and a replay button. For more on how this site approaches production engineering, see About; for help designing webhook or event pipelines on a growing API, Contact.

Recibe un correo cuando se publiquen artículos nuevos. Sin spam — solo entradas nuevas de este blog.

Con Resend. Puedes darte de baja en cualquier correo.