HTTP API admission control: concurrency limits, queues, and load shedding

Cap concurrent HTTP handlers per instance, add bounded queue waits or fail fast with 503s, and align limits with database pools and Kubernetes readiness. Focused on Node.js production APIs.

Author: Matheus Palma7 min read
Software engineeringBackendAPI designNode.jsTypeScriptReliabilityKubernetes

You scale out horizontally, autoscaling reacts to CPU, and yet during traffic spikes the fleet still enters a death spiral: latency climbs, clients retry, open connections multiply, and each instance accepts more work than it can finish before deadlines fire upstream. The failure mode is not always “the database is down.” Often it is unbounded admission: every accepted TCP connection becomes an active handler, and the runtime keeps scheduling microtasks until event-loop lag makes every response late.

This article is about admission control at the HTTP edge of your own service: how many requests may execute concurrently, whether you queue a small backlog or reject immediately, and how that interacts with platform signals (readiness, shutdown) and client behavior (retries, backoff). The patterns apply broadly; examples use Node.js-style concurrency because that is where unbounded async handlers bite hardest in production APIs I have helped teams harden in freelance and consulting work.

Admission control as a product of capacity planning

Admission control answers: “Given this process and this machine, may we start (or continue) this unit of work right now?” It is distinct from:

  • Rate limiting (often per identity or IP), which protects you from abuse and fairness violations but does not cap simultaneous expensive operations from legitimate users.
  • Circuit breaking on outbound calls, which stops you from hammering a sick dependency but does not stop your server from accepting inbound work that will immediately block on that dependency.
  • Autoscaling, which reacts on delay and may add capacity after overload has already degraded the fleet.

You still want those tools. Admission control is the missing piece that ties inbound concurrency to local resources (CPU, memory, file descriptors, thread pools inside native addons) so that overload becomes bounded queueing or fast, honest rejection instead of unbounded latency.

Three levers: concurrency, queue depth, and shed policy

Concurrency caps

A hard cap on in-flight requests (per process or per worker) is the simplest effective guardrail. Semantics:

  • When active < max, accept and run.
  • When active == max, either wait in a queue (up to a limit) or reject with 503 Service Unavailable (optionally with Retry-After).

Why a cap works: each request has a predictable upper bound on concurrent expensive operations—database pool usage, parallel fetch fan-out, CPU-bound JSON transforms—so tail latency stops growing linearly with offered load beyond the cap.

Queue depth and wait budgets

A short FIFO queue smooths bursts without dropping work: briefly more than max arrivals can wait milliseconds for a slot. Two risks:

  1. Queue time eats the SLA. If your p99 latency budget is 300 ms and clients wait 250 ms in your queue before execution starts, you have almost no room left for real work. Queue depth must be paired with a maximum wait; exceed it and treat as overload (reject or shed).
  2. FIFO + retries amplifies storms. Retry storms are not theoretical. If upstream returns 503 without jittered backoff, clients may align and hit you in phase. A queue helps only if combined with retry-after discipline and idempotency on the client side.

Shed policy: who gets rejected?

When you must reject:

  • Prefer 503 with a clear body (problem details if you use RFC 9457) and Retry-After when you can estimate recovery (even roughly).
  • Avoid 502 from your app for pure overload; intermediaries and operators read that as “bad gateway,” which confuses triage.
  • Consider degrading non-critical routes first (feature-flagged “heavy” reports) while keeping health and OAuth metadata paths admitted—a form of priority admission that requires explicit routing tiers, not a single global mutex.

In-process vs distributed admission

In-process limiter

Pros: simple, no extra infrastructure, low latency. Cons: each instance has its own view; under Kubernetes, N pods each admit M concurrent requests, so aggregate concurrency is N × M. That is fine if downstream (connection pool, DB) is sized for that aggregate. It is wrong if you assumed M globally.

Distributed limiter (Redis, etcd, sidecar)

Pros: global view, useful for strict shared budgets (e.g., third-party API with a hard global QPS). Cons: latency, failure modes, and correctness: a lease-based token bucket is eventually consistent; you still need local caps so one slow Redis does not block the event loop forever.

Practical split I use on production systems: local semaphore for “do not melt this pod” plus optional global budget for a scarce external dependency. The global piece can be coarse (token every X ms) while the local piece enforces fine-grained safety.

Coordinating with Kubernetes readiness

Admission control should connect to readiness:

  • When local queue depth or event-loop lag crosses a threshold, fail readiness so the Service stops sending new connections to that instance while existing ones drain (within your grace period).
  • That is not a substitute for rejecting excess work at the socket boundary; it reduces new load during partial failure. Combine with preStop hooks and graceful shutdown patterns you already use.

Practical example: bounded concurrency with optional queue

Below is a minimal per-process admission controller for a Node.js HTTP server. It tracks active handlers, allows a small wait queue with a wait budget, and returns 503 with Retry-After: 1 when overloaded. It is illustrative—not a drop-in for every framework, but the structure ports to Fastify hooks, Express middleware, or http2 servers.

import http from "node:http";

type AdmissionOptions = {
  maxConcurrent: number;
  maxQueue: number;
  maxWaitMs: number;
};

export function createAdmissionMiddleware(opts: AdmissionOptions) {
  let active = 0;
  const queue: Array<() => void> = [];

  const tryDrainQueue = () => {
    while (active < opts.maxConcurrent && queue.length > 0) {
      const next = queue.shift();
      next?.();
    }
  };

  return function admit<T>(run: () => Promise<T>): Promise<T> {
    return new Promise((resolve, reject) => {
      const start = () => {
        active += 1;
        run()
          .then(resolve, reject)
          .finally(() => {
            active -= 1;
            tryDrainQueue();
          });
      };

      if (active < opts.maxConcurrent) {
        start();
        return;
      }

      if (queue.length >= opts.maxQueue) {
        reject(new Error("OVERLOADED"));
        return;
      }

      const timer = setTimeout(() => {
        const i = queue.indexOf(wrappedStart);
        if (i !== -1) queue.splice(i, 1);
        reject(new Error("QUEUE_TIMEOUT"));
      }, opts.maxWaitMs);

      const wrappedStart = () => {
        clearTimeout(timer);
        start();
      };

      queue.push(wrappedStart);
    });
  };
}

const admit = createAdmissionMiddleware({
  maxConcurrent: 32,
  maxQueue: 64,
  maxWaitMs: 200,
});

const server = http.createServer((req, res) => {
  void admit(async () => {
    // ... real handler: parse, auth, DB, upstream fetch with deadlines ...
    res.statusCode = 200;
    res.end("ok");
  }).catch((err) => {
    if (err instanceof Error && (err.message === "OVERLOADED" || err.message === "QUEUE_TIMEOUT")) {
      res.statusCode = 503;
      res.setHeader("Retry-After", "1");
      res.setHeader("Content-Type", "application/json");
      res.end(JSON.stringify({ title: "Overloaded", status: 503 }));
      return;
    }
    res.statusCode = 500;
    res.end();
  });
});

server.listen(8080);

Why this shape: active bounds parallel execution; the queue absorbs short bursts; maxWaitMs prevents requests from sitting until they are guaranteed to miss upstream deadlines. In a real service you would attach per-route limits (cheap GET vs expensive POST), propagate cancellation from req.close / req.signal, and emit metrics (admission_rejected_total, admission_queue_wait_seconds).

Common mistakes and pitfalls

  1. Only autoscaling, no admission. Scaling adds pods after lag spikes; without per-pod caps, each new pod immediately accepts unbounded work and the DB sees a larger aggregate fan-out.

  2. Queue without wait budget. Unbounded or long queues turn overload into timeout cascades: your server finally answers after the client or gateway already closed the connection, wasting CPU for a result nobody reads.

  3. Global Redis semaphore blocking the event loop. If every request awaits a distributed lock acquisition, Redis latency becomes p999 of every route. Prefer short non-blocking tries, local fallback, or offload to a worker pool.

  4. 503 without Retry-After. Clients default to aggressive retry; a one-second hint materially reduces synchronized retry peaks.

  5. Identical caps for all routes. A health check and a CSV export should not share one flat semaphore if export starves interactive traffic—use weighted or separate pools.

  6. Ignoring cooperative cancellation. Admission frees a slot when the handler finishes; if half your “finished” handlers leaked background work, you still melt. Pair admission with deadlines and cancellation on all async boundaries.

Conclusion

Admission control makes overload predictable: you choose between a small amount of queueing and fast rejection, instead of letting the runtime accept every connection until tail latency destroys the product. The implementation can stay in-process for most Node.js APIs; the hard part is choosing numbers that match pool sizes and downstream SLAs, and wiring readiness so the platform stops routing new traffic to instances that are already saturated.

The payoff is a service that fails loudly and cheaply under spike—something you can alert on and tune—rather than one that limps with multi-second p99 until something opaque falls over. If you are designing these boundaries for a new platform or untangling overload behavior in an existing API surface, getting admission, timeouts, and backpressure aligned is a core part of building scalable, production-ready backends.

Subscribe to the newsletter

Get an email when new articles are published. No spam — only new posts from this blog.

Powered by Resend. You can unsubscribe from any email.