Outbound webhooks for SaaS APIs: signing, retries, and subscriber lifecycle

Design customer-facing webhook delivery with durable attempts, HMAC signing, exponential backoff, endpoint verification, and clear disable policies when subscribers go unhealthy.

作者: Matheus Palma约 11 分钟阅读
Software engineeringBackendAPI designWebhooksPostgreSQLReliabilitySaaS

Your product ships order.completed events to customer endpoints. One subscriber’s staging server returns 502 for twelve hours while their on-call is on vacation. Without guardrails, your delivery workers hammer the URL, your queue depth grows, and unrelated tenants see delayed notifications because shared workers are stuck in retry loops. In consulting engagements, outbound webhooks are where “we published an event” diverges from “the customer actually received it”—and where naive retry logic turns a single bad URL into a platform incident.

This article covers designing webhook delivery as a product surface: durable attempt records, signed payloads, backoff that respects subscriber health, endpoint verification at registration time, and lifecycle rules when endpoints stay broken. The patterns pair naturally with a transactional outbox or an append-only event log; they are the HTTP layer your customers integrate against, symmetric to how you would harden inbound webhook receivers.

Why outbound delivery is harder than “POST and forget”

Inbound webhooks (Stripe, GitHub, billing providers) arrive at your controlled endpoint. You verify signatures, deduplicate, and respond with precise status codes. Outbound webhooks invert the trust boundary: your system initiates HTTP to URLs customers control. That introduces constraints you do not fully own:

  • Availability — Customer servers go down, certificates expire, WAFs block your IP ranges, and rate limits differ per tenant.
  • Latency — Slow endpoints block workers unless you isolate concurrency per subscriber or use bounded thread pools.
  • Semantics — Customers expect at-least-once delivery with stable event IDs, verifiable signatures, and documented retry behavior.
  • Abuse surface — Registration flows that accept arbitrary URLs can become SSRF vectors if you “verify” endpoints by fetching internal addresses.

Treating delivery as “fire HTTP from the request thread after commit” fails on the first crash between commit and POST. Production systems decouple event creation from HTTP delivery and make every attempt observable.

Core architecture: events, deliveries, and attempts

A durable model with three layers keeps operations legible:

LayerResponsibility
EventImmutable fact: type, data, occurred_at, tenant/subscriber scope
DeliveryOne logical send of one event to one subscriber endpoint (may have many attempts)
AttemptSingle HTTP try: request metadata, response code, duration, next retry time

Events are append-only. Deliveries are created when a subscriber is subscribed to an event type (or when a fan-out rule matches). Attempts record each HTTP round trip.

This separation matters for support: “Was invoice.paid emitted?” is an event question; “Did Acme Corp’s endpoint accept it?” is a delivery question; “What did their server return on the third try?” is an attempt question.

Fan-out and the outbox

When the business transaction commits, insert the event—and optionally pre-create delivery rows—in the same database transaction as the state change (outbox style). A relay or worker process picks up pending deliveries, performs HTTP, and schedules retries.

Fan-out strategies:

  • Eager fan-out — Create one delivery row per matching subscriber at write time. Simple to query (“pending deliveries for subscriber X”) but more rows when you have thousands of endpoints per event type.
  • Lazy fan-out — Store the event once; a worker resolves subscribers at delivery time. Fewer rows, but subscriber list changes between emit and deliver need explicit policy (new subscribers typically do not retroactively receive old events unless you offer replay).

For most B2B SaaS products, eager fan-out with delivery rows is easier to operate and audit.

Signing payloads customers can verify

Customers must authenticate that a POST originated from you. Mirror industry practice (Stripe, GitHub, Svix):

  1. Serialize the body (JSON) as UTF-8 bytes.
  2. Include a timestamp in a header to bound replay windows on the customer side.
  3. Compute HMAC-SHA256(secret, "${timestamp}.${rawBody}") and send the digest in a signature header.
  4. Rotate signing secrets per subscriber (or per workspace) with overlap periods.

Document the exact header names, whether the timestamp is seconds or milliseconds, and that verification must use the raw body before JSON parsing—same rule as inbound verification.

Secret rotation without breaking integrations

Store current_secret and previous_secret on the subscriber record. Sign with current_secret; verification on the customer side should accept either digest during the overlap window. Provide an API or dashboard action to rotate secrets and surface the new value once.

HTTP contract: what you send and what you expect

A practical envelope:

{
  "id": "evt_01j2k3m4n5p6",
  "type": "order.completed",
  "created_at": "2026-06-08T14:22:11.123Z",
  "data": {
    "order_id": "ord_789",
    "amount_cents": 4200
  }
}

Headers (example convention):

  • X-Webhook-Id — Same as id for quick log correlation.
  • X-Webhook-Timestamp — Unix seconds when you generated the attempt.
  • X-Webhook-Signaturev1=<hex> HMAC as described above.
  • User-Agent — Identifiable product string (helps customers filter logs).

Success is usually any 2xx within a timeout you document (commonly 5–15 seconds). Some teams accept 409 Conflict when the customer deduplicates by event ID—document that explicitly if you support it.

Retry on: connection errors, timeouts, 408, 429, and 5xx. Do not retry on most 4xx except 408/429—a 400 or 401 will not heal with backoff; surface the failure and pause or disable the endpoint.

Retry schedule: exponential backoff with jitter and caps

Unbounded retries to a dead endpoint waste capacity and annoy customers who see log floods when they fix DNS. A schedule that works well in production:

  • Base delay: start around 30–60 seconds after first failure (not instant—gives brief blips time to recover).
  • Exponential multiplier: double capped at 24 hours between attempts.
  • Full jitter on each delay: sleep = random(0, min(cap, base * 2^attempt)) to desynchronize workers.
  • Max attempts: commonly 10–15 over ~48–72 hours, then mark delivery failed and stop until manual replay or subscriber fix.

Persist next_attempt_at on the delivery row so workers can SELECT ... WHERE status = 'pending' AND next_attempt_at <= now() ORDER BY next_attempt_at LIMIT N FOR UPDATE SKIP LOCKED—the same job-queue pattern used in outbox relays.

Per-subscriber concurrency limits

A bulkhead per subscriber prevents one slow endpoint from occupying all workers. A simple pattern: at most K in-flight HTTP requests per subscriber_id (often K=1–3). Additional deliveries for that subscriber wait in pending without blocking other tenants.

Subscriber lifecycle: verification, health, and disable policies

Endpoint verification at registration

When a customer registers https://api.acme.com/hooks/orders, do not immediately trust it. Common verification flow:

  1. Generate a one-time challenge token.
  2. Ask the customer to echo it in a response to a GET or POST verification request, or require them to receive a signed endpoint.verification test event and return 200.
  3. Only then set endpoint_status = 'active'.

SSRF mitigation is non-negotiable: resolve the hostname, reject private/link-local IP ranges, block redirects to internal addresses, and consider disallowing raw IPs unless enterprise customers require them. Fetch through a dedicated verification worker with short timeouts—not from your admin API process.

Automatic disable after sustained failure

After max attempts or a streak of 401/403/404, transition the subscriber to disabled and emit an internal alert plus a customer-visible notification (email or dashboard). Include:

  • Last error summary (status code, truncated body).
  • Link to replay documentation.
  • Reminder to rotate secrets if signatures failed verification on their side.

Re-enablement should require explicit customer action (fix URL, confirm challenge, or click “resume deliveries”) so you do not resume into a known-bad configuration.

Manual replay and event browsing

Operators and customers will ask to replay evt_01j2k3m4n5p6. Replay creates a new delivery (or resets an existing failed delivery) with a fresh attempt chain; do not mutate historical attempt rows. Expose list/filter APIs by type, time range, and delivery status—this reduces support load more than any retry tweak.

Practical example: delivery worker with signing and bounded retries

The following TypeScript sketch shows the worker loop, HMAC signing, attempt persistence, and retry scheduling. It intentionally omits framework glue; the structure maps to a cron worker, a queue consumer, or a SKIP LOCKED poller.

import { createHmac, randomBytes } from "node:crypto";

type DeliveryRow = {
  id: string;
  event_id: string;
  subscriber_id: string;
  endpoint_url: string;
  signing_secret: string;
  attempt_count: number;
  max_attempts: number;
  status: "pending" | "delivered" | "failed" | "disabled";
  next_attempt_at: Date;
  payload: { id: string; type: string; created_at: string; data: unknown };
};

const ATTEMPT_TIMEOUT_MS = 10_000;
const BASE_DELAY_SEC = 60;
const MAX_DELAY_SEC = 86_400;

function signBody(secret: string, timestampSec: number, rawBody: string): string {
  const msg = `${timestampSec}.${rawBody}`;
  return "v1=" + createHmac("sha256", secret).update(msg).digest("hex");
}

function computeNextAttemptAt(attemptCount: number): Date {
  const exp = Math.min(MAX_DELAY_SEC, BASE_DELAY_SEC * 2 ** attemptCount);
  const jitterSec = Math.floor(Math.random() * exp);
  return new Date(Date.now() + jitterSec * 1000);
}

function shouldRetry(status: number | null, err: unknown): boolean {
  if (err) return true;
  if (status === null) return true;
  if (status === 408 || status === 429) return true;
  if (status >= 500) return true;
  return false;
}

export async function deliverOne(
  delivery: DeliveryRow,
  deps: {
    fetch: typeof fetch;
    recordAttempt: (row: {
      delivery_id: string;
      attempt_number: number;
      http_status: number | null;
      duration_ms: number;
      error?: string;
    }) => Promise<void>;
    updateDelivery: (id: string, patch: Partial<DeliveryRow>) => Promise<void>;
  }
): Promise<void> {
  const attemptNumber = delivery.attempt_count + 1;
  const rawBody = JSON.stringify(delivery.payload);
  const timestampSec = Math.floor(Date.now() / 1000);
  const signature = signBody(delivery.signing_secret, timestampSec, rawBody);

  const started = Date.now();
  let httpStatus: number | null = null;
  let error: string | undefined;

  try {
    const res = await deps.fetch(delivery.endpoint_url, {
      method: "POST",
      headers: {
        "Content-Type": "application/json",
        "User-Agent": "AcmeWebhooks/1.0",
        "X-Webhook-Id": delivery.payload.id,
        "X-Webhook-Timestamp": String(timestampSec),
        "X-Webhook-Signature": signature,
        "X-Delivery-Id": delivery.id,
      },
      body: rawBody,
      signal: AbortSignal.timeout(ATTEMPT_TIMEOUT_MS),
    });
    httpStatus = res.status;

    if (res.ok) {
      await deps.recordAttempt({
        delivery_id: delivery.id,
        attempt_number: attemptNumber,
        http_status: httpStatus,
        duration_ms: Date.now() - started,
      });
      await deps.updateDelivery(delivery.id, {
        status: "delivered",
        attempt_count: attemptNumber,
      });
      return;
    }
  } catch (e) {
    error = e instanceof Error ? e.message : "delivery_error";
  }

  await deps.recordAttempt({
    delivery_id: delivery.id,
    attempt_number: attemptNumber,
    http_status: httpStatus,
    duration_ms: Date.now() - started,
    error,
  });

  if (!shouldRetry(httpStatus, error)) {
    await deps.updateDelivery(delivery.id, {
      status: "failed",
      attempt_count: attemptNumber,
    });
    return;
  }

  if (attemptNumber >= delivery.max_attempts) {
    await deps.updateDelivery(delivery.id, {
      status: "failed",
      attempt_count: attemptNumber,
    });
    return;
  }

  await deps.updateDelivery(delivery.id, {
    status: "pending",
    attempt_count: attemptNumber,
    next_attempt_at: computeNextAttemptAt(attemptNumber),
  });
}

/** Worker: claim pending deliveries with SKIP LOCKED (pseudo-SQL) */
export async function runDeliveryWorker(claimBatch: () => Promise<DeliveryRow[]>) {
  const batch = await claimBatch();
  await Promise.all(
    batch.map((d) =>
      deliverOne(d, {
        fetch,
        recordAttempt: async () => {},
        updateDelivery: async () => {},
      })
    )
  );
}

Wire claimBatch to your database:

SELECT d.*
FROM webhook_deliveries d
WHERE d.status = 'pending'
  AND d.next_attempt_at <= now()
ORDER BY d.next_attempt_at
LIMIT 50
FOR UPDATE SKIP LOCKED;

Index (status, next_attempt_at) for the poller. Index (subscriber_id, status) for per-tenant dashboards.

Observability and support contracts

Metrics worth emitting from day one:

  • webhook_delivery_attempts_total — Labels: event_type, http_status_class, subscriber_tier.
  • webhook_delivery_latency_seconds — Histogram of end-to-end attempt duration.
  • webhook_deliveries_pending — Gauge by age bucket (<5m, <1h, >24h).
  • webhook_subscribers_disabled_total — Counter when auto-disable fires.

Logs should include delivery_id, event_id, subscriber_id, and never log signing secrets or full customer response bodies (truncate to a few hundred bytes).

For enterprise customers, document SLAs honestly: “at-least-once within N minutes under normal conditions” beats vague “real-time” marketing. Offer a status page or health dashboard showing recent failure rates per endpoint.

Common mistakes and pitfalls

  1. Synchronous delivery in the request path — Commit succeeds, HTTP to customer fails, and you have no durable retry record. Always persist before POST.
  2. Retrying 401 and 404 forever — Wastes workers; disable quickly and notify the customer.
  3. No per-subscriber bulkheads — One broken endpoint stalls delivery for everyone sharing the worker pool.
  4. Weak endpoint verification — Accepting http://169.254.169.254/ or internal hostnames turns registration into SSRF.
  5. Re-serializing JSON for signing — Customers verify against raw bytes; pretty-print differences break HMAC checks.
  6. Mutable event payloads on retry — The same event_id must carry identical data on every attempt unless you version events explicitly.
  7. Missing idempotency guidance — Customers must dedupe by event_id; say so in docs and echo the id in headers.
  8. No replay API — Support teams manually re-insert rows in production databases. Build a first-class replay that audits who triggered it.

Conclusion

Outbound webhooks are a small distributed system you ship to every customer integration. Durable deliveries and attempts, HMAC signing with rotation, backoff with jitter and hard caps, per-subscriber concurrency limits, and explicit disable/replay lifecycle turn “we emitted an event” into a supportable product promise.

The teams that operate these systems calmly treat customer endpoints as unreliable dependencies: isolate failure, observe every attempt, and fail visibly when an integration is broken instead of hiding behind infinite retries. That discipline is exactly what production SaaS platforms need when event volume grows and a single bad URL can no longer be a manual ops ticket.

If you are designing webhook delivery, event fan-out, or subscriber management for a multi-tenant API, the patterns here are the baseline I apply before discussing SLAs or enterprise features. For architecture reviews or implementation help on scalable notification pipelines, see contact.

订阅邮件通讯

新文章发布时收到邮件。无垃圾信息 — 仅本博客的新文章通知。

由 Resend 发送,可在邮件中退订。