Outbound webhook delivery at scale: signing payloads, retry budgets, and dead-letter operations
Ship customer-facing webhooks with at-least-once delivery, HMAC signing, bounded retries, idempotency-friendly event IDs, and operator workflows when endpoints stay broken.
Your product finally exposes webhooks. A customer wires POST /hooks/acme into their ERP, you enqueue order.created after checkout, and the first week feels like a win. Then finance opens a ticket: the same invoice was booked three times. Another customer swears they never received subscription.canceled even though your dashboard shows “delivered.” Meanwhile, on-call is paging because one broken endpoint—returning 500 for twelve hours—has consumed half your worker pool with retries. Outbound webhooks are a distributed system you operate, not a fetch() you fire from a controller and forget.
This article covers how to design at-least-once delivery with verifiable payloads, bounded retries, and operator-grade dead-letter handling. The patterns mirror what you need when integrating payment processors or shipping webhooks inbound—except now you are the sender, and your customers’ endpoints are the unreliable dependency.
Delivery semantics: what you promise (and what you do not)
Most SaaS webhooks are at-least-once: an event may arrive more than once, but it should not silently disappear if your side is healthy. That contract pushes complexity to both sides:
- Your platform must persist delivery attempts, retry on transient failures, and surface permanent failures.
- Your customer must treat handlers as idempotent (dedupe on
event_id, use idempotency keys on side effects).
At-most-once (fire-and-forget) is simpler but unacceptable for billing, inventory, or compliance events. Exactly-once end-to-end is a myth across HTTP; do not market it unless you control both endpoints and the storage layer.
Document the contract in your developer docs: HTTP methods, headers, retry schedule, timeout, signature algorithm, and which status codes count as success vs retry vs permanent failure.
Event model: stable IDs, versions, and envelopes
Every outbound message should be a versioned envelope with fields your customers can rely on for years:
| Field | Purpose |
|---|---|
id | Globally unique event id (UUIDv7 or ULID) for idempotent consumption |
type | Dot-separated name (order.created) with a published schema per type |
api_version | Breaking payload changes bump this; old types may coexist during migration |
created_at | ISO-8601 timestamp of when the event was recorded, not first delivery attempt |
data | Type-specific payload; avoid nesting critical ids only inside data |
Generate id once when the business fact is committed, store it durably, and reuse it across retries. If you mint a new id per attempt, customers cannot dedupe—and you will eventually double-charge someone’s workflow.
Ordering: usually “none,” sometimes “per resource”
HTTP webhooks rarely guarantee global ordering. If order.updated can arrive before order.created, customers suffer. Mitigations:
- Per-resource sequence numbers (
order_id+ monotonicsequence) so consumers can buffer or reject gaps. - Timestamps are not ordering—clock skew and retries make
created_ata weak sort key. - Document that unrelated resources are unordered; do not imply FIFO unless you implement it (partitioned outbox per
aggregate_id).
In consulting engagements, the expensive incidents almost always trace back to implicit ordering assumptions in the customer’s handler, not to TLS or JSON.
Signing payloads: HMAC over the raw body
Customers must verify that events originated from you and were not tampered with in transit. The industry-standard pattern is HMAC-SHA256 over the exact bytes of the request body, with a per-endpoint secret.
Typical headers:
X-Webhook-Id— same as envelopeidX-Webhook-Timestamp— Unix seconds when the delivery attempt was signed (used for replay windows on the receiver)X-Webhook-Signature—v1=<hex>ort=<ts>,v1=<hex>style
Signing steps:
- Read the serialized JSON body as a
BufferorUint8Array—no pretty-print drift between sign and send. - Construct the signed string (example):
${timestamp}.${bodyUtf8}. HMAC-SHA256(secret, signed_string)→ hex digest.- Send the same bytes you signed.
Rotate secrets with dual-active verification windows: issue whsec_new, accept signatures from either secret for 7–14 days, then revoke the old secret. Never log secrets or include them in error payloads.
Retry policy: backoff, budgets, and classification
Retries turn a blip into a storm unless bounded. A practical default for B2B webhooks:
| Outcome | HTTP status / condition | Action |
|---|---|---|
| Success | 2xx within timeout | Mark delivered; stop |
| Retryable | 408, 429, 5xx, connect timeout, reset | Schedule retry with backoff + jitter |
| Non-retryable | 400, 401, 403, 404, 410, malformed URL | Dead-letter immediately |
| Ambiguous | 2xx but body says error (avoid this) | Treat as success only if you document it |
Use exponential backoff with full jitter: delay = random(0, min(cap, base * 2^attempt)). Example schedule with base=60s, cap=24h, max_attempts=12 spans roughly two days—enough for maintenance windows without infinite load.
Per-endpoint circuit breaking
If an endpoint fails continuously, pause new delivery attempts after N consecutive failures or M attempts in an hour. Surface a “disabled due to failures” state in the customer dashboard and email the admin. Without this, one bad URL becomes a denial-of-service against your own workers.
Concurrency and fairness
Shard work by endpoint_id so one slow customer cannot starve others. Cap in-flight deliveries per endpoint (for example, 5) and global worker concurrency separately.
Persistence: outbox, delivery log, and dead letters
The durable spine is usually three tables (names vary):
- Events — immutable facts (
id,type,payload,created_at). - Deliveries — one row per
(event_id, endpoint_id)with state machine:pending→delivering→delivered|failed|dead_lettered. - Attempts — append-only log of each HTTP try (status, duration, error snippet, next_retry_at).
This mirrors the transactional outbox: write the business row and the event in one DB transaction; a relay process fans out to subscribed endpoints. Use FOR UPDATE SKIP LOCKED when claiming pending deliveries so multiple workers scale horizontally without double-sending the same attempt row—PostgreSQL’s row locks make this straightforward.
Dead-letter when retries are exhausted or the failure is permanent. Operators need:
- Search by
endpoint_id,event_type, time range - Manual replay (single event or batch) after the customer fixes their handler
- Payload redaction in UI for PII-heavy events
Observability: metrics that prevent surprise pages
Instrument at minimum:
webhook_delivery_attempts_total{endpoint, outcome}webhook_delivery_latency_seconds{endpoint}histogramwebhook_dead_lettered_total{endpoint, reason}webhook_queue_depthgauge
Alert on dead-letter rate and oldest undelivered event age, not on single 500s. Trace each attempt with the same event_id in logs and spans so support can answer “what happened to evt_…?” without raw SQL.
Practical example: delivery worker with signing and bounded retries
The following TypeScript sketch shows a claim → sign → POST → record attempt loop. It intentionally omits subscription management and UI; the focus is the delivery state machine customers feel in production.
import { createHmac, randomUUID } from "node:crypto";
type DeliveryRow = {
deliveryId: string;
eventId: string;
endpointId: string;
url: string;
secret: string;
attempt: number;
payload: Record<string, unknown>;
};
type Sql = {
query: <T>(text: string, params?: unknown[]) => Promise<{ rows: T[] }>;
};
const MAX_ATTEMPTS = 12;
const BASE_DELAY_SEC = 60;
const CAP_DELAY_SEC = 86_400;
const REQUEST_TIMEOUT_MS = 10_000;
function envelope(eventId: string, type: string, data: Record<string, unknown>) {
return {
id: eventId,
type,
api_version: "2026-06-01",
created_at: new Date().toISOString(),
data,
};
}
function signBody(secret: string, timestamp: number, body: string): string {
const signed = `${timestamp}.${body}`;
const digest = createHmac("sha256", secret).update(signed).digest("hex");
return `t=${timestamp},v1=${digest}`;
}
function retryDelaySeconds(attempt: number): number {
const exp = Math.min(CAP_DELAY_SEC, BASE_DELAY_SEC * 2 ** attempt);
return Math.floor(Math.random() * exp);
}
function isRetryableStatus(status: number): boolean {
return status === 408 || status === 429 || status >= 500;
}
export async function claimNextDelivery(db: Sql): Promise<DeliveryRow | null> {
const { rows } = await db.query<DeliveryRow>(
`UPDATE webhook_deliveries d
SET state = 'delivering', locked_at = now(), lock_token = $1
WHERE d.id = (
SELECT id FROM webhook_deliveries
WHERE state = 'pending'
AND (next_retry_at IS NULL OR next_retry_at <= now())
ORDER BY created_at
FOR UPDATE SKIP LOCKED
LIMIT 1
)
RETURNING d.id AS "deliveryId", d.event_id AS "eventId",
d.endpoint_id AS "endpointId", d.url, d.secret,
d.attempt, d.payload`,
[randomUUID()],
);
return rows[0] ?? null;
}
export async function deliverOne(db: Sql, row: DeliveryRow): Promise<void> {
const bodyObj = envelope(row.eventId, String(row.payload.type), row.payload.data as Record<string, unknown>);
const body = JSON.stringify(bodyObj);
const timestamp = Math.floor(Date.now() / 1000);
const signature = signBody(row.secret, timestamp, body);
const controller = new AbortController();
const timer = setTimeout(() => controller.abort(), REQUEST_TIMEOUT_MS);
let httpStatus = 0;
let errorMessage: string | null = null;
try {
const res = await fetch(row.url, {
method: "POST",
headers: {
"content-type": "application/json",
"user-agent": "Acme-Webhooks/1.0",
"x-webhook-id": row.eventId,
"x-webhook-timestamp": String(timestamp),
"x-webhook-signature": signature,
},
body,
signal: controller.signal,
});
httpStatus = res.status;
if (!res.ok) errorMessage = `HTTP ${res.status}`;
} catch (e) {
errorMessage = e instanceof Error ? e.message : "network error";
} finally {
clearTimeout(timer);
}
const success = httpStatus >= 200 && httpStatus < 300;
const retryable = !success && (httpStatus === 0 || isRetryableStatus(httpStatus));
const nextAttempt = row.attempt + 1;
await db.query(
`INSERT INTO webhook_delivery_attempts
(delivery_id, attempt, http_status, error, duration_ms)
VALUES ($1, $2, $3, $4, $5)`,
[row.deliveryId, nextAttempt, httpStatus || null, errorMessage, null],
);
if (success) {
await db.query(
`UPDATE webhook_deliveries
SET state = 'delivered', delivered_at = now(), attempt = $2
WHERE id = $1`,
[row.deliveryId, nextAttempt],
);
return;
}
if (!retryable || nextAttempt >= MAX_ATTEMPTS) {
await db.query(
`UPDATE webhook_deliveries
SET state = 'dead_lettered', attempt = $2, last_error = $3
WHERE id = $1`,
[row.deliveryId, nextAttempt, errorMessage],
);
return;
}
const delaySec = retryDelaySeconds(nextAttempt);
await db.query(
`UPDATE webhook_deliveries
SET state = 'pending',
attempt = $2,
next_retry_at = now() + ($3 || ' seconds')::interval,
last_error = $4
WHERE id = $1`,
[row.deliveryId, nextAttempt, String(delaySec), errorMessage],
);
}
Wire this behind a small worker fleet (or a cron-driven loop with concurrency limits). Expose replay by resetting state to pending, clearing next_retry_at, and optionally bumping a replay_count for audit.
Common mistakes and pitfalls
- New event id per retry — breaks customer idempotency; one business fact must map to one
id. - Signing parsed-then-reserialized JSON — whitespace and key order change the digest; sign the bytes you send.
- Retrying
400forever — wastes capacity; dead-letter and notify the customer. - No per-endpoint circuit breaker — one bad URL degrades delivery for everyone.
- Treating
2xxwith an error JSON body as failure — pick one rule; mixing confuses retries. - Unbounded worker concurrency — duplicates effort under load; use
SKIP LOCKEDclaims and per-endpoint caps. - Missing operator replay — without it, support resets rows by hand in production.
- Omitting
410 Gonehandling — if a customer deletes a endpoint, stop retrying immediately.
Conclusion
Outbound webhooks reward the same discipline as payment APIs: durable events, cryptographic authenticity, honest retry semantics, and visibility when delivery fails for good. At-least-once delivery is the right default if you pair it with stable event ids and clear documentation; bounded backoff and circuit breaking keep one broken endpoint from becoming your incident.
Teams shipping integrations for the first time often underestimate the operations surface—dead letters, replays, and signing rotations are where production maturity shows up. Getting this layer right is high-leverage work when you are building platforms that other engineering teams depend on, whether in-house or as part of helping a client harden a B2B API for real customer endpoints.
Subscribe to the newsletter
Get an email when new articles are published. No spam — only new posts from this blog.
Powered by Resend. You can unsubscribe from any email.