Circuit breakers, bulkheads, and timeouts: isolating failure in distributed systems
How to combine circuit breakers, resource isolation, and explicit timeouts so one slow dependency does not take down your API—patterns, trade-offs, and Node.js-oriented examples.
A checkout flow calls a fraud-scoring service. Under Black Friday load, that service starts responding in tens of seconds instead of hundreds of milliseconds. Without guardrails, your Node workers pile up waiting: thread pools (if any), connection pools, and the event loop itself become saturated with pending I/O. Healthy paths—reading from cache, serving static assets, health checks—start failing alongside the bad path. In consulting work, this pattern shows up whenever a synchronous-looking dependency (HTTP, gRPC, Redis) sits on the critical path without bounded wait times and failure containment.
This article explains three complementary mechanisms—timeouts, circuit breakers, and bulkheads—why each addresses a different failure mode, and how to combine them in production-oriented services.
Why “let it fail slowly” is worse than a hard error
Distributed systems fail in partial, messy ways. Dependencies do not always return clean 503 responses; they often degrade into long tails. A client that waits indefinitely amplifies the problem:
- Queueing delay grows superlinearly as upstream latency spikes.
- Retries (even with idempotency) multiply load on an already struggling peer.
- Cascading failure: your service becomes unavailable because it is waiting on someone else’s outage.
The goal is not to hide errors from callers forever—it is to fail fast, shed load when a peer is unhealthy, and isolate resources so one bad integration cannot exhaust shared capacity across the whole process.
Timeouts: bounding the wait
A timeout is the simplest contract: “I will not wait longer than T for this operation.” It applies to outbound HTTP calls, database queries, message broker publishes, and DNS lookups.
What to set the timeout to
There is no universal number. Derive it from:
- SLO for the caller: if the API promises p99 < 500 ms end-to-end, the dependency budget is whatever remains after your own work—often tens to low hundreds of milliseconds per hop.
- Observed latency: use percentiles from staging or production, not local dev.
- Retry policy: if you retry twice, total worst-case wait is roughly
timeout × attemptsunless you use jittered backoff and smaller per-attempt budgets.
Client vs server timeouts
Connect timeouts cover TCP/TLS establishment; read (idle) timeouts cover “no bytes received for X ms.” Both matter: a peer that accepts the connection but never sends a body still burns resources without a read timeout.
Trade-offs
Too short: false positives—sporadic GC pauses or cross-AZ jitter trigger failures. Too long: you absorb someone else’s outage. Tune with metrics: timeout rate, retry rate, and latency percentiles together.
Circuit breakers: stop hammering the sick dependency
A circuit breaker wraps calls to a dependency and tracks recent outcomes. After enough failures (or slow responses, if you use latency thresholds), the breaker opens: subsequent calls fail immediately without hitting the network. After a reset timeout, it enters a half-open state and allows a probe request; success closes the circuit, failure reopens it.
Why it helps
- Protects the unhealthy peer from a retry storm.
- Frees your workers to return errors or fallback behavior quickly.
- Gives operators a clear signal: elevated “circuit open” metrics correlate with dependency incidents.
Implementation notes
Libraries (e.g. Opossum in Node, resilience4j on the JVM) differ, but the core ideas are the same:
- Failure criteria: count HTTP 5xx, thrown errors, or timeouts as failures; decide whether 429/403 should trip the breaker (often no, unless you treat them as overload signals).
- Volume: require a minimum request volume before opening on error rate alone, so cold starts do not flip the breaker on the first error.
- Half-open concurrency: allow only a small number of trial requests when recovering.
Limitations
A circuit breaker does not create capacity; it redirects failure. You still need timeouts—otherwise a “closed” circuit can still block on a hung socket. Breakers also do not fix logical bugs; they only react to observed failure patterns.
Bulkheads: isolating resource pools
A bulkhead partitions shared resources so that overload in one area does not drain another. Named after ship compartments: one flooded section does not sink the vessel.
Common applications
- Separate connection pools per downstream service (or per criticality tier), so a runaway fan-out to Service A cannot exhaust the pool used for Service B.
- Dedicated worker pools or separate processes for admin or batch traffic vs interactive API traffic.
- Queue depth limits with shedding or 503 when full, instead of unbounded in-memory queues.
In Node.js specifically
Node’s event loop is single-threaded, but connection pools, open sockets, and memory for buffered responses are still finite. A bulkhead strategy might mean separate HTTP agents with distinct maxSockets, separate Redis clients, or even splitting high-risk paths into a separate service with its own scaling and failure domain.
Trade-offs
More bulkheads mean higher baseline resource use (idle pools) and operational complexity. Start with clear separation between user-facing and background workloads; split further when metrics show cross-contamination.
How the three fit together
| Mechanism | Primary problem addressed |
|---|---|
| Timeout | Unbounded waits, hung connections |
| Circuit | Retry storms, sustained dependency failure |
| Bulkhead | Shared pool exhaustion, cross-tenant bleed |
In practice: wrap each outbound call with a timeout, group calls to the same dependency behind a circuit breaker, and assign those clients to dedicated pools (bulkheads) sized for expected concurrency.
Practical example: typed wrapper in TypeScript
The following sketch shows a small dependency client with an explicit timeout and placeholders for circuit and pool configuration. It is not a drop-in replacement for a full library, but it illustrates how layers compose.
import { Agent } from "node:http";
type FetchJsonOptions = {
/** Abort after this many ms (read + connect budget simplified as one) */
timeoutMs: number;
/** Optional: pass a dedicated Agent for bulkheading */
agent?: Agent;
};
export async function fetchJsonWithBounds<T>(
url: string,
opts: FetchJsonOptions,
): Promise<T> {
const controller = new AbortController();
const t = setTimeout(() => controller.abort(), opts.timeoutMs);
try {
const res = await fetch(url, {
signal: controller.signal,
// @ts-expect-error — Agent typing varies by Node version
dispatcher: opts.agent,
});
if (!res.ok) {
throw new Error(`HTTP ${res.status}`);
}
return (await res.json()) as T;
} finally {
clearTimeout(t);
}
}
In production you would:
- Replace bare
fetchwith a client instrumented for metrics (latency histogram, outcome labels). - Wrap
fetchJsonWithBoundswith a circuit breaker from your stack (or a small state machine if the dependency is singular). - Instantiate one Agent per downstream with
maxSocketsaligned to bulkhead sizing and upstream rate limits.
For Express or Fastify handlers, enforce a request deadline: if the overall budget is exceeded, abort downstream work and return 503 or a degraded response—this is the server-side analogue of a client timeout.
Common mistakes and pitfalls
- Timeouts only on the client: the server must also enforce limits (body size, handler duration) or attackers and buggy clients can hold connections open.
- Breaker on everything: low-volume endpoints may never reach minimum volume; combine with hedging or static health checks for rare calls.
- Same pool for critical and optional paths: optional enrichment (recommendations, analytics) should use a separate pool and stricter timeouts so it cannot starve core reads.
- Opening the circuit on business validation errors: tripping breakers on
400responses hides client bugs and masks real availability metrics. - Ignoring half-open storms: after recovery, too many instances probing at once can reload the dependency; use jitter and coordinated half-open limits where possible.
Conclusion
Timeouts bound how long you wait; circuit breakers bound how often you hammer a failing peer; bulkheads bound how much shared capacity a single dependency class can consume. Together, they turn “everything is slow” into localized, observable failure—which is a prerequisite for resilient APIs and for teams that need to ship predictable, production-ready systems under load.
Used consistently, these patterns make incidents smaller: dashboards show which breaker opened and which pool saturated, and fixes can target the right integration instead of scaling the entire fleet to absorb one bad hop.
Suscríbete al boletín
Recibe un correo cuando se publiquen artículos nuevos. Sin spam — solo entradas nuevas de este blog.
Con Resend. Puedes darte de baja en cualquier correo.