Reliable webhook emitters: delivery guarantees, signing, and observability

How to build webhook senders that partners can trust: outbox-backed delivery, HMAC signing, exponential backoff, endpoint health, and metrics that surface silent failures.

Author: Matheus Palma9 min read
Software engineeringArchitectureBackendAPI designReliabilityWebhooks

Your SaaS product exposes webhooks so customers can react to order.created and subscription.cancelled in their own systems. A partner’s endpoint returns 503 for six hours during a deploy. Your worker keeps retrying; their queue fills; your support inbox fills faster. Meanwhile another customer rotated their signing secret in the dashboard but never updated their verifier—every delivery fails with 401, and they blame your platform for “missing events.” In consulting engagements, emitting webhooks is where teams discover that “fire HTTP POST and forget” is not a product feature—it is a small distributed system with its own SLOs, security model, and on-call surface.

This article covers the sender side: how to persist intent, sign payloads, schedule retries without amplifying outages, and instrument delivery so operators and customers can see what happened. It complements receiver-side verification and idempotency; together they form a contract both parties can implement against.

Why synchronous HTTP from the request path fails

The tempting design is straightforward: after a business transaction commits, fetch(customer.webhookUrl, { method: "POST", body }) from the same handler. That breaks under real conditions:

  • Partial failure. The HTTP call times out after the customer’s server already processed the event. Your code may retry from the request thread, duplicate from a crash recovery job, or never retry at all.
  • Customer downtime. A slow or failing endpoint blocks your workers or lengthens user-facing latency if delivery is inline.
  • No audit trail. Support cannot answer “did you send event evt_abc?” without log archaeology.
  • Thundering retries. Naive fixed-interval retries hammer a recovering endpoint and can trigger rate limits or circuit breakers on both sides.

Production emitters decouple business commits from HTTP delivery. The transactional outbox pattern (covered elsewhere on this blog) is the usual foundation: the same database transaction that updates business state inserts a row describing the outbound webhook. A dedicated dispatcher turns outbox rows into signed HTTP requests with controlled concurrency and backoff.

Core architecture: outbox, dispatcher, delivery ledger

Outbox row: what to store

Each pending delivery should capture everything the dispatcher needs without re-querying fragile application state:

FieldPurpose
event_idGlobally unique, stable identifier returned to customers and used for deduplication on their side
event_typeString discriminator (invoice.paid, etc.)
payloadSerialized JSON body (versioned schema)
destination_urlSnapshot at enqueue time, or reference resolved at dispatch
tenant_id / subscription_idScoping for secrets, rate limits, and dashboards
created_atOrdering and lag metrics
next_attempt_atScheduler input for backoff
attempt_countCap retries and classify poison deliveries

Store the URL and secret reference as they were when the event was created if customers can change endpoints mid-flight; otherwise a late config change can sign with the wrong secret or POST to a stale URL.

Delivery ledger: attempts are first-class

Beyond the outbox, maintain an attempt log (table or append-only store) per delivery:

  • HTTP status, response body snippet (truncated), duration, error class (timeout, DNS, TLS)
  • Whether the attempt will be retried or marked terminal

Customers expect a delivery history UI in mature webhook products. Building the ledger from day one avoids retrofitting observability when the first enterprise deal asks for exportable logs.

Signing: contract the receiver can implement

Sign the exact bytes the customer will receive. Common pattern:

  1. Serialize JSON payload to a UTF-8 buffer (stable key order if you document it; many providers sign raw body as sent).
  2. Include a Unix timestamp in a header (X-Webhook-Timestamp) and in the signed material.
  3. Compute HMAC-SHA256(secret, timestamp + "." + rawBody) and send as X-Webhook-Signature: v1=<hex>.

Document:

  • Which headers are included in the signature
  • Maximum clock skew you expect receivers to allow
  • How secret rotation works (support two active secrets during overlap)

Never log the full secret or unsanitized payload containing PII in application logs.

Retry policy: classify outcomes before retrying

Not every non-2xx should retry the same way:

ResponseTypical action
2xxSuccess; mark delivered
408, 429, 5xx, network timeoutRetry with backoff + jitter
401, 403Likely misconfiguration; retry slowly, alert customer, disable after N failures
404, 410Endpoint gone; pause or disable subscription after policy threshold
400, 422Often permanent payload mismatch; do not infinite-retry; surface to your team

Exponential backoff with full jitter reduces synchronized retry storms when many deliveries target the same failing host:

function nextAttemptDelayMs(attempt: number, baseMs = 1000, capMs = 3600_000): number {
  const exp = Math.min(capMs, baseMs * 2 ** attempt);
  return Math.floor(Math.random() * exp);
}

Cap total attempts (for example 24 hours of retries) and route exhausted deliveries to a dead-letter state visible in the dashboard—not silent deletion.

Concurrency and fairness

Per-destination concurrency limits (token bucket or fixed worker slots) protect customers from burst traffic after an outage and protect your egress IPs from looking like a denial-of-service attack. Global limits protect your dispatcher fleet.

Fair scheduling matters in multi-tenant systems: one customer’s broken endpoint must not starve others. Partition work queues by tenant_id or use weighted fair queuing when backlog grows.

Endpoint health and automatic disable

Track rolling failure rates per subscription. After sustained 4xx/5xx/timeout streaks:

  1. Pause new delivery attempts (outbox rows stay queued).
  2. Email or in-app notify the customer with the last error snippet and docs link.
  3. Provide a test delivery button that sends a synthetic ping event without waiting for real traffic.

Re-enable on successful test delivery or explicit customer action. Automatic disable is preferable to burning through retry budgets and eroding trust (“your webhooks stopped months ago and we never noticed”).

Observability: metrics and support workflows

Emit metrics your on-call can page on and your customers can see in aggregate:

  • webhook_delivery_attempt_total{result,event_type}
  • webhook_delivery_lag_seconds (time from outbox insert to first successful 2xx)
  • webhook_dispatch_queue_depth
  • webhook_disabled_subscriptions_total

Structured logs should always include event_id, tenant_id, destination_host (not full URL with secrets in query strings), attempt, and http_status.

For support, lookup by event_id must return payload hash, signing version, attempt timeline, and whether the customer acknowledged receipt (if you implement optional callback headers).

Practical example: dispatcher worker sketch

The following TypeScript sketch ties outbox polling, signing, and classified retries into one worker loop. It is illustrative—not a drop-in library—but matches patterns used in production webhook products.

import crypto from "node:crypto";

type OutboxRow = {
  eventId: string;
  eventType: string;
  payload: Record<string, unknown>;
  destinationUrl: string;
  signingSecret: string;
  attemptCount: number;
};

type AttemptResult =
  | { kind: "success" }
  | { kind: "retry"; reason: string }
  | { kind: "terminal"; reason: string };

function signBody(secret: string, rawBody: Buffer, timestamp: number): string {
  const signed = `${timestamp}.${rawBody.toString("utf8")}`;
  const digest = crypto.createHmac("sha256", secret).update(signed).digest("hex");
  return `v1=${digest}`;
}

function classify(status: number | null, error: unknown): AttemptResult {
  if (status !== null && status >= 200 && status < 300) return { kind: "success" };
  if (status === 401 || status === 403) return { kind: "terminal", reason: "auth" };
  if (status === 404 || status === 410) return { kind: "terminal", reason: "gone" };
  if (status === 400 || status === 422) return { kind: "terminal", reason: "client_reject" };
  return { kind: "retry", reason: status === null ? "network" : `http_${status}` };
}

export async function dispatchOnce(row: OutboxRow): Promise<AttemptResult> {
  const rawBody = Buffer.from(JSON.stringify(row.payload));
  const timestamp = Math.floor(Date.now() / 1000);
  const signature = signBody(row.signingSecret, rawBody, timestamp);

  const controller = new AbortController();
  const timeout = setTimeout(() => controller.abort(), 10_000);

  let status: number | null = null;
  try {
    const res = await fetch(row.destinationUrl, {
      method: "POST",
      headers: {
        "Content-Type": "application/json",
        "X-Webhook-Id": row.eventId,
        "X-Webhook-Timestamp": String(timestamp),
        "X-Webhook-Signature": signature,
        "User-Agent": "MyProduct-Webhooks/1.0",
      },
      body: rawBody,
      signal: controller.signal,
    });
    status = res.status;
    const snippet = (await res.text()).slice(0, 512);
    await recordAttempt(row.eventId, { status, snippet, attempt: row.attemptCount + 1 });
  } catch (err) {
    await recordAttempt(row.eventId, {
      status: null,
      snippet: String(err),
      attempt: row.attemptCount + 1,
    });
    return classify(null, err);
  } finally {
    clearTimeout(timeout);
  }

  return classify(status, null);
}

async function recordAttempt(
  eventId: string,
  detail: { status: number | null; snippet: string; attempt: number },
): Promise<void> {
  // Persist to delivery_ledger; implementation-specific.
  void eventId;
  void detail;
}

export async function runDispatcher(fetchBatch: () => Promise<OutboxRow[]>): Promise<void> {
  for (const row of await fetchBatch()) {
    const result = await dispatchOnce(row);
    if (result.kind === "success") {
      await markDelivered(row.eventId);
      continue;
    }
    if (result.kind === "terminal") {
      await markFailed(row.eventId, result.reason);
      await maybeDisableSubscription(row.destinationUrl, result.reason);
      continue;
    }
    const delay = nextAttemptDelayMs(row.attemptCount);
    await scheduleRetry(row.eventId, row.attemptCount + 1, delay);
  }
}

async function markDelivered(_eventId: string): Promise<void> {}
async function markFailed(_eventId: string, _reason: string): Promise<void> {}
async function maybeDisableSubscription(_url: string, _reason: string): Promise<void> {}
async function scheduleRetry(_eventId: string, _attempt: number, _delayMs: number): Promise<void> {}

Pair this emitter with idempotent receivers (document that X-Webhook-Id is the deduplication key) and with versioned event schemas so additive JSON fields do not break strict parsers.

When helping teams ship webhook products, the gap is rarely HMAC itself—it is lifecycle: rotation, disable rules, replay tooling for customers, and honest documentation of at-least-once semantics.

Common mistakes and pitfalls

Inline delivery from the API request. Couples user latency to customer infrastructure and loses messages on process crash. Always persist first.

Retrying every 4xx identically. Wastes resources and hides misconfiguration; 401 will not heal with aggressive backoff.

No jitter. Fixed or pure exponential schedules synchronize across deliveries and prolong incidents.

Changing payload shape without event_type versioning. Customers parse strictly; ship invoice.paid.v2 or document additive-only JSON rules.

Omitting delivery IDs. Receivers cannot dedupe; you get duplicate side effects and blame.

Unbounded response body logging. Customer error pages can be huge or contain secrets; truncate and redact.

Global disable without customer signal. Silent pause erodes trust; always notify and offer test hooks.

Conclusion

Reliable webhook emitters treat outbound HTTP as durable, signed, observable work—not a side effect of a successful transaction. Persist events in an outbox, dispatch asynchronously with classified retries and jitter, maintain a delivery ledger, and automate endpoint health so failures become visible before customers open tickets.

Key takeaways:

  • Decouple commit from HTTP; the outbox gives you atomic intent with the business write
  • Sign raw bodies with documented headers and support secret rotation
  • Classify HTTP outcomes; retry transient failures, terminalize config errors, and cap total attempts
  • Instrument lag, failures, and per-subscription health; build support tools around event_id

Teams integrating partner systems or exposing platform webhooks benefit from designing both sides of the contract early. If you are building scalable, production-ready notification or integration layers and want a second pair of eyes on architecture, get in touch—the receiver and emitter patterns are most valuable when they match.

Subscribe to the newsletter

Get an email when new articles are published. No spam — only new posts from this blog.

Powered by Resend. You can unsubscribe from any email.