Designing outbound HTTP clients for production: retries, Retry-After, and per-dependency budgets
How to classify failures, implement backoff with jitter, honor Retry-After, and cap retries per dependency so outbound integrations stay stable under load and partial outages.
Your service fans out to a payment processor, a CRM, and a notification channel. During a regional blip, every instance starts retrying aggressively. Within minutes you have amplified the outage: your aggregate outbound QPS exceeds the vendor’s steady-state limit, responses slow down, thread pools fill, and your own error rate climbs from a dependency you do not control. In consulting and product work, this pattern appears constantly—retries are necessary for resilience, but unbounded retries are a distributed denial-of-service aimed at yourself and your partners.
This article focuses on the client side of HTTP integrations: how to decide what to retry, how to space attempts, when to obey Retry-After, and how to cap work with per-dependency budgets so recovery helps instead of deepens incidents. It complements server-side patterns such as idempotency keys and circuit breakers; together they form a coherent story for production-grade integrations.
Why naive retries fail
A common first implementation wraps fetch in a loop: on any error, wait a fixed second and try again, up to N times. That fails for several independent reasons:
- Retrying non-retryable semantics.
400validation errors and401auth failures will not succeed on repeat without changing inputs or credentials. Retrying wastes capacity and can trigger abuse controls. - Thundering herds. A fixed delay synchronizes many clients; they wake up together and spike traffic. Jitter breaks correlation across processes.
- Ignoring server intent.
429 Too Many Requestsand503 Service Unavailableoften includeRetry-After. Ignoring it signals that your client does not participate in the shared stability contract. - No global cap. Each goroutine, worker, or request handler may apply its own retry loop. Ten thousand concurrent users can become ten thousand times your intended retry fan-out.
The goal is not “more retries,” but controlled, cooperative retries that respect deadlines, idempotency constraints, and vendor signals.
Classifying outcomes before you retry
Treat the decision to retry as a policy function of method, status, error type, and elapsed budget—not as a reflex.
Transport and client-side failures
DNS failures, connection resets, TLS handshake errors, and ETIMEDOUT usually indicate transient conditions or local network issues. These are typical candidates for retry if the operation is idempotent or paired with an idempotency key on the server.
HTTP status codes (rules of thumb)
| Category | Examples | Typical retry? |
|---|---|---|
| Success | 2xx | No |
| Client error (stable) | 400, 404, 409 (conflict), 422 | No—fix payload or workflow |
| Client error (auth) | 401, 403 | No—refresh credentials or stop |
| Rate limit | 429 | Yes, after honoring Retry-After and backoff |
| Server / overload | 500, 502, 503, 504 | Sometimes—prefer Retry-After when present |
| Early hints / redirect | 103, 3xx | Follow redirects; usually no extra “retry loop” |
POST without idempotency is the sharp edge: a timeout after the server accepted work is ambiguous. Prefer server support for idempotency keys or natural idempotency (e.g., upsert by stable business key) before retrying mutating calls.
Idempotency and HTTP methods
GET, HEAD, OPTIONS, and well-defined PUT/DELETE on stable resources are safer to retry than arbitrary POST. For PATCH, semantics vary; treat as POST unless your API contract guarantees idempotency.
Backoff with jitter: shapes and trade-offs
Exponential backoff increases delay between attempts: with base delay b, attempt index k, and cap C, a common schedule is roughly “double the delay each attempt until you hit the cap.” That reduces load on a recovering dependency compared to fixed intervals.
Jitter randomizes the wait so clients do not align. Two common approaches:
- Full jitter (Amazon’s classic pattern): sample the wait uniformly between zero and the capped exponential value. Lower expected wait, strong desynchronization; slightly more variance for individual callers.
- Equal jitter: wait for half of the capped exponential value plus a uniform random amount in the other half. Tighter lower bound than full jitter, still breaks herd behavior.
Trade-off: longer caps and more attempts improve success probability under long outages but extend tail latency for user-facing paths. User-facing calls often need short budgets and fast fail with a clear error; background jobs can afford longer caps.
Honoring Retry-After
Servers send Retry-After as either:
- HTTP-date — an absolute time to try again
- delay-seconds — a non-negative integer delay in seconds
When both 429/503 and Retry-After are present, a robust client should:
- Parse the header according to RFC semantics
- Compute
waitMs = max(0, suggestedWait - alreadyElapsed)relative to your attempt clock - Combine with your backoff policy using
max(backoffJitter, retryAfterWait)so you never retry sooner than the server asked, but you may wait longer if your exponential schedule says so (some teams usemaxexclusively for throttling responses to avoid under-waiting)
Cache-related 503 with Retry-After: 0 is a special case meaning “retry immediately” in some stacks—still apply a small jitter to avoid tight loops.
Retry budgets and per-dependency limits
A retry budget is a ceiling on retry attempts per logical operation or per time window for a dependency. Examples:
- At most 2 retries for a synchronous user-facing checkout step; surface failure and let support or async reconciliation handle edge cases
- At most 5 retries over 60 seconds for a background enrichment worker, with circuit breaker integration
Per-dependency concurrency limits (bulkheads) ensure one flaky partner cannot exhaust all sockets or worker threads. Pair outbound pools with timeouts and cancellation propagated from the inbound request when applicable.
Observability: what to measure
Log or metricize at minimum:
- Attempt number, outcome class (success, retryable failure, terminal failure), and latency per attempt
- Whether
Retry-Afterwas present and the chosen wait - Correlation and trace ids propagated to the partner (when allowed) for support tickets
Sudden increases in retry rate often precede partner incidents; dashboards on retry ratio (retries ÷ total attempts) catch unhealthy integrations early.
Practical example: a small TypeScript policy wrapper
The following is a self-contained sketch for Node.js-style fetch. It is not a drop-in library; it shows how policy pieces fit together. Replace sleep with abort-aware waiting in production if you tie retries to AbortSignal.
type RetryPolicy = {
maxAttempts: number;
baseDelayMs: number;
maxDelayMs: number;
/** Respect Retry-After on 429 / 503 when present */
respectRetryAfter: boolean;
};
const defaultPolicy: RetryPolicy = {
maxAttempts: 4,
baseDelayMs: 200,
maxDelayMs: 8_000,
respectRetryAfter: true,
};
function parseRetryAfter(header: string | null, now: Date): number | null {
if (!header) return null;
const trimmed = header.trim();
if (/^\d+$/.test(trimmed)) {
return Math.max(0, parseInt(trimmed, 10)) * 1000;
}
const when = Date.parse(trimmed);
if (!Number.isNaN(when)) {
return Math.max(0, when - now.getTime());
}
return null;
}
function fullJitterBackoff(attempt: number, policy: RetryPolicy): number {
const exp = Math.min(policy.maxDelayMs, policy.baseDelayMs * 2 ** attempt);
return Math.floor(Math.random() * (exp + 1));
}
function isRetryableStatus(status: number): boolean {
return status === 429 || status === 408 || status === 502 || status === 503 || status === 504;
}
export async function fetchWithRetries(
input: RequestInfo | URL,
init: RequestInit | undefined,
policy: RetryPolicy = defaultPolicy
): Promise<Response> {
let lastError: unknown;
for (let attempt = 0; attempt < policy.maxAttempts; attempt++) {
try {
const res = await fetch(input, init);
if (res.ok) return res;
const terminal = res.status >= 400 && res.status < 500 && res.status !== 429;
if (terminal) return res;
const retryable = isRetryableStatus(res.status);
if (!retryable) return res;
if (attempt === policy.maxAttempts - 1) return res;
let waitMs = fullJitterBackoff(attempt, policy);
if (policy.respectRetryAfter && (res.status === 429 || res.status === 503)) {
const ra = parseRetryAfter(res.headers.get("retry-after"), new Date());
if (ra !== null) waitMs = Math.max(waitMs, ra);
}
await new Promise((r) => setTimeout(r, waitMs));
} catch (err) {
lastError = err;
if (attempt === policy.maxAttempts - 1) throw err;
const waitMs = fullJitterBackoff(attempt, policy);
await new Promise((r) => setTimeout(r, waitMs));
}
}
throw lastError instanceof Error ? lastError : new Error(String(lastError));
}
Usage for an idempotent GET:
const res = await fetchWithRetries("https://api.partner.example/v1/ledger/abc", {
headers: { Authorization: `Bearer ${token}` },
signal: AbortSignal.timeout(5_000),
});
For mutating POST calls, only use such a wrapper when the upstream documents idempotency or you have a deduplication story; otherwise fail closed after ambiguous timeouts.
Common mistakes and pitfalls
Retrying without reading the body on errors. Some servers gate keep-alive or connection reuse on body consumption; always drain or cancel the body when discarding a response you will retry, or use libraries that handle this consistently.
Infinite retry on 5xx without caps. Combine max attempts, max elapsed time, and circuit breaking so a total dependency outage does not pin your process.
Double-applying jitter and Retry-After. Pick a single rule for combining them; document it so on-call engineers know why waits are long.
Per-request retry inside high fan-out workers. If each of 50 parallel tasks retries five times, you have 250× amplification. Move retries to a shared client with concurrency limits or centralize calls through a queue.
Ignoring clock skew for HTTP-date Retry-After. Large skew can produce negative waits; clamp to zero or a small floor with jitter.
Conclusion
Outbound HTTP is a stability contract between your service and its partners: classify errors, back off with jitter, obey Retry-After, and cap retries with explicit budgets and bulkheads. Those practices turn retries from a hazard into a controlled recovery mechanism and keep integrations maintainable as traffic grows.
Key takeaways:
- Retry only retryable conditions; treat
4xx(except throttling) as signals to stop or fix inputs - Use exponential backoff with jitter to avoid synchronized spikes
- Honor
Retry-Afteron throttling and overload responses - Enforce per-dependency budgets and timeouts, and align mutating retries with idempotency on the server
Teams building scalable, production-ready systems routinely invest in this layer early—it is cheaper than emergency rate-limit negotiations after an incident. For architecture questions or collaboration, the contact page is the right place to reach out.
Suscríbete al boletín
Recibe un correo cuando se publiquen artículos nuevos. Sin spam — solo entradas nuevas de este blog.
Con Resend. Puedes darte de baja en cualquier correo.