Load shedding for HTTP APIs: prioritization, degradation tiers, and backpressure

Autoscaling lags behind spikes. Load shedding rejects low-priority work so critical APIs stay fast: signals, priority buckets, Node admission sketches, HTTP semantics, and pitfalls.

作者: Matheus Palma2026年5月13日约 8 分钟阅读

Software engineeringBackendAPI designSite reliability engineeringNode.jsKubernetes

Black Friday lands, or a partner misconfigures a batch job, or a cache layer evaporates—and your API’s p99 latency climbs from tens of milliseconds to multiple seconds. Autoscaling adds instances minutes later; connection pools exhaust first; thread pools or event-loop lag make every request pay for the overload. Load shedding is the deliberate decision to stop accepting some work so the system can still complete what matters most. It is not pessimism; it is how large-scale services stay predictable when demand exceeds provisioned capacity.

This article walks through why shedding belongs in the application layer, how it differs from rate limiting, which signals to trust, and how to express priority without turning your HTTP stack into an unmaintainable special case. The patterns come up repeatedly in freelance and consulting engagements where teams already have circuit breakers and retries but still see retry storms and cascading saturation under spikes.

Why “just add replicas” is incomplete

Horizontal scaling helps average load. It does not instantly fix:

Cold starts and placement lag on container platforms.
Downstream bottlenecks (a single legacy database, a partner API with a hard quota).
Thundering herds after deploys or cache failures—every client retries at once.
Head-of-line blocking when one slow dependency ties up workers that could serve fast reads.

Shedding answers a different question than autoscaling: given finite capacity right now, which requests should we refuse or simplify? That is a product and reliability decision, not only an infrastructure knob.

Vocabulary: shedding, limiting, and throttling

These terms blur in conversation; separating them keeps designs reviewable.

Rate limiting

Rate limiting caps how often a caller may invoke an endpoint, usually per identity or IP. It is fair-use enforcement and abuse control. A well-behaved client under its quota can still spike internally (many users behind one API key) or trigger expensive queries.

Throttling

Throttling slows work—smaller concurrency, longer delays between steps—so average throughput drops without hard failure. It preserves all work eventually if the client waits. It increases queueing, which is dangerous when queues are unbounded.

Load shedding

Load shedding drops or short-circuits work when the system is under stress. Clients receive 503 Service Unavailable, 429 Too Many Requests with Retry-After, or a degraded response (cached snapshot, partial payload). The goal is to fail fast for low-priority traffic so high-priority traffic retains CPU, sockets, and pool slots.

You typically need all three: limits for fairness, throttles for smooth background pipelines, and shedding for survival during incidents.

Signals: what to measure before you shed

Shedding on the wrong signal creates flapping (accept, reject, accept) or silent starvation. Good controllers combine local and dependency signals.

Local signals

Event-loop lag (Node.js): sustained lag often predicts imminent timeouts better than CPU alone on I/O-heavy services.
Process memory and GC pressure: approaching limits, shed before the OOM killer does it randomly.
In-flight request count per instance: a simple proxy for “how full am I.”

Dependency signals

Pool utilization (HTTP agents, database pools, Redis): when wait time for a connection grows, new requests will only queue behind doomed work.
Upstream health flags from circuit breakers: if payments are already failing open, do not spend slots on new checkout mutations.

Golden rule

Prefer signals that reflect time to complete useful work, not raw QPS. Two hundred cheap cache hits are not equivalent to twenty heavy report exports.

Priority classes and degradation tiers

Not all HTTP traffic is equal. A practical approach:

Classify requests at the edge (middleware, API gateway, or service mesh) using route patterns, headers (X-Request-Priority, Accept profile), or authenticated tenant tier (free vs paid vs internal).
Reserve capacity for higher classes—either explicit concurrency caps per class or weighted random early detection-style probability of admission under stress.
Degrade instead of hard-fail when possible: serve stale cache, skip optional enrichments, or return 202 Accepted with a job id for heavy exports.

Example classification table

Class	Typical routes	Under stress
Critical	Auth, payment capture, health checks used by orchestrators	Never shed; may still timeout if dependencies fail
Interactive	Product pages, search with freshness SLAs	Shed lowest tiers first; serve stale where policy allows
Background	Analytics ingestion, recommendation recomputation	Shed aggressively; clients should backoff

The exact rows depend on the product. What matters is that engineering and product agree in writing before an incident.

Admission control: where to shed in the stack

Reverse proxy / gateway: coarse, fast, limited context—good for global IP floods, geographic rules.
Application middleware: full auth context, route cost hints, custom headers—best for business-aware shedding.
Worker pools inside handlers: fine-grained shedding for CPU-bound stages (image transcoding) after cheap validation.

Putting shedding only at the edge misses authenticated per-user fairness; putting it only deep in handlers burns sockets and TLS handshakes first. A layered model usually wins.

Local versus coordinated admission

Per-instance counters (like the example below) are simple and fast. They fail when traffic is unevenly routed: one hot shard sheds while others idle. Mitigations:

Sticky routing with health-aware balancers reduces variance but does not erase it.
Gossip or central token buckets (Redis, etcd) add latency and a new failure mode—use them when global fairness matters more than microsecond admission cost.
Kubernetes readiness: temporarily remove overloaded pods from Service endpoints so new connections steer away; combine with in-process shedding because existing connections may remain pinned.

Treat coordination as a capacity allocation problem: approximate global limits with periodic sync if strict accuracy is unnecessary.

Retry behavior and the feedback loop

Shedding shifts failure to the client. If clients interpret 503 as “retry immediately,” load increases. Production systems should document:

Minimum backoff and full jitter between retries on 503/429.
Max retry budget per user action so a UI does not hammer the API indefinitely.
Idempotency-Key on POST so safe server-side deduplication exists when retries eventually succeed.

From the server side, correlate shed events with client User-Agent or SDK version spikes; a misbehaving SDK can look like a DDoS.

Practical example: Node.js admission gate with priority buckets

The following TypeScript sketch shows a single-process admission controller. In production you would export metrics, tie isOverloaded to real signals, and coordinate optional global state via Redis—this version focuses on structure.

type Priority = "critical" | "interactive" | "background";

type SheddingConfig = {
  maxInFlight: number;
  /** Max concurrent per priority when healthy */
  softLimits: Record<Priority, number>;
  /** When overloaded, hard caps (critical usually unchanged) */
  overloadCaps: Record<Priority, number>;
};

const defaultConfig: SheddingConfig = {
  maxInFlight: 200,
  softLimits: { critical: 80, interactive: 100, background: 120 },
  overloadCaps: { critical: 80, interactive: 40, background: 10 },
};

class AdmissionController {
  private inFlight = 0;
  private byPriority: Record<Priority, number> = {
    critical: 0,
    interactive: 0,
    background: 0,
  };

  constructor(
    private readonly config: SheddingConfig,
    private readonly isOverloaded: () => boolean,
  ) {}

  tryAcquire(priority: Priority): boolean {
    if (this.inFlight >= this.config.maxInFlight) return false;

    const caps = this.isOverloaded()
      ? this.config.overloadCaps
      : this.config.softLimits;

    if (this.byPriority[priority] >= caps[priority]) return false;

    this.inFlight += 1;
    this.byPriority[priority] += 1;
    return true;
  }

  release(priority: Priority): void {
    this.inFlight = Math.max(0, this.inFlight - 1);
    this.byPriority[priority] = Math.max(0, this.byPriority[priority] - 1);
  }
}

/** Express-style middleware sketch */
function createSheddingMiddleware(
  ctrl: AdmissionController,
  classify: (req: { path: string; header: (n: string) => string | undefined }) => Priority,
) {
  return (req: { path: string; header: (n: string) => string | undefined }, res: { status: (n: number) => typeof res; setHeader: (k: string, v: string) => void; end: () => void }, next: () => void) => {
    const p = classify(req);
    if (!ctrl.tryAcquire(p)) {
      res.setHeader("Retry-After", "2");
      return res.status(503).end();
    }
    res.on("finish", () => ctrl.release(p));
    next();
  };
}

Hook isOverloaded to event-loop delay, DB pool wait histograms, or a boolean from a circuit breaker. Keep the critical class narrow—teams that mark everything critical shed nothing useful.

HTTP semantics for clients

When you shed:

Prefer 503 with Retry-After for transient overload; well-behaved clients exponential-backoff.
Use 429 when shedding doubles as fairness policy; document whether it counts against API quotas.
Include a stable X-Error-Code or problem-details body (RFC 9457) so SDKs can branch without parsing English text.

If you return degraded JSON, version the shape so clients can detect partial success explicitly.

Common mistakes and pitfalls

Shedding after expensive work already started—authentication, body parsing, file upload buffering. Move cheap classification and admission as early as the protocol allows.
Unbounded queues “to avoid dropping requests.” Queues hide overload until latency explodes; cap queue depth and shed when full.
Uniform random shedding without priority—high-value tenants and health probes fail alongside scrapers. Always combine with identity or route awareness.
Ignoring retry amplification: if every 503 causes an immediate blind retry, you amplify load. Pair shedding with jittered backoff documentation and idempotency on mutating routes.
Missing observability: without counters for shed requests per route and priority, incident reviews devolve into guesswork. Emit explicit metrics, not only logs.

Conclusion

Load shedding is how you choose failure modes instead of inheriting them from the kernel or cloud provider. Combine clear priority classes, early admission control, and honest HTTP signaling so clients and operators know what happened. Autoscaling still matters—but under stress, what you refuse to do defines whether the core product stays usable.

If you are designing APIs or platform primitives meant to stay calm under spikes, it pays to model shedding alongside timeouts, circuit breakers, and capacity tests—not as a last-minute incident knob. For more on how this site frames engineering background, see About; for collaboration or inquiries, Contact.

新文章发布时收到邮件。无垃圾信息 — 仅本博客的新文章通知。

由 Resend 发送，可在邮件中退订。