API rate limiting: token buckets, sliding windows, and distributed fairness

How to choose algorithms for HTTP APIs, implement limits that survive restarts and multiple nodes, and communicate limits clearly to clients—without turning throttling into guesswork.

Autor: Matheus Palma2 de abril de 20267 min de lectura

Software engineeringArchitectureBackendAPI designReliabilityRedis

A public API starts taking traffic from a new partner integration. Within hours, a misconfigured client opens thousands of parallel connections and sends bursty traffic every few seconds. Your database connection pool saturates, legitimate users see timeouts, and on-call is paging you while you try to distinguish abuse from a bug. Rate limiting is how you protect capacity, cost, and fairness—but “add Redis and cap requests” undersells the design space. In consulting and product work, the useful question is not only how many requests per second, but which algorithm, where in the stack it runs, and what clients should do when they hit the wall.

This article walks through common algorithms, their trade-offs, a concrete implementation sketch, and mistakes that show up when limits move from a single process to a fleet of API nodes.

Why rate limits are a product and architecture concern

Rate limits serve several goals at once:

Protect upstream resources—databases, third-party APIs, CPU-bound work—so one tenant or client cannot exhaust shared pools.
Bound cost—especially for metered LLM or payment APIs where every excess call has a dollar sign.
Enable fair sharing—so noisy neighbors do not starve others on multi-tenant infrastructure.

They are not a substitute for authentication or input validation, but they complement timeouts, circuit breakers, and bulkheads (see the About page for how these themes fit together in production-minded engineering). Limits fail badly when they are opaque: clients retry blindly, amplify load, and make incidents worse. Good limits are predictable, documented, and paired with clear HTTP signals.

Core algorithms and when to use them

Fixed window

Partition time into buckets (for example one minute). Count requests per client per bucket; reset the count when the window rolls.

Pros: Simple to implement in memory or with a single counter per key in a store. Easy to explain in docs (“1000 requests per minute”).

Cons: Burst at boundaries. A client can send 1000 requests at 00:59 and another 1000 at 01:00—2000 in two seconds while staying “within” 1000/minute. For strict enforcement, fixed windows alone are often insufficient.

Sliding window log

Store timestamps of each accepted request (or a sample) for a client and drop requests when the log for the rolling interval exceeds the cap.

Pros: Smooth enforcement without boundary spikes.

Cons: Memory and write amplification—high-traffic keys need many timestamp entries or clever compaction. Often implemented approximately for scale.

Sliding window counter

Approximate a sliding window by combining the current fixed-window count with a weighted fraction of the previous window based on how far you are into the current window. Reduces storage versus a full log while smoothing bursts better than a naive fixed window.

Pros: Good balance for many HTTP APIs at scale.

Cons: Still an approximation; edge cases need testing against your exact formula.

Token bucket

A bucket holds tokens refilled at a steady rate (for example 10 tokens per second, cap 100). Each request consumes one token; if the bucket is empty, reject or delay.

Pros: Allows bursts up to bucket capacity while enforcing a sustainable average—matches many product expectations (“burst then steady”).

Cons: Burst allowance must be tuned so a single client cannot still overwhelm a fragile dependency; document refill rate and burst size separately.

Leaky bucket (smoothing)

Requests enter a queue and leak out at a fixed rate; excess is dropped or delayed.

Pros: Very smooth output rate to downstream systems.

Cons: Queuing introduces latency and memory use; less common for edge HTTP APIs than token bucket for typical REST use cases.

Practical takeaway: Public REST APIs often combine token bucket or sliding window semantics at the edge with per-route stricter limits on expensive endpoints. Internal service meshes may use simpler fixed windows per route if traffic is trusted.

Distributed enforcement: one box versus many

In-process counters are fast but wrong as soon as you run multiple instances: each node has its own count, so global limits are multiplied by replica count unless you centralize state.

Common approaches:

Central store (Redis, DynamoDB, etc.)
Atomic increments (INCR), sorted sets for sliding windows, or Lua scripts for token bucket math. Latency adds to every request; use local soft limits plus shared hard limits if you need both speed and accuracy.
Sticky sessions
Route a client always to the same node so local counters approximate global behavior. Fragile when nodes restart or autoscale; still not exact.
Edge / API gateway
Enforce at CDN or gateway layer for coarse global limits; finer per-tenant limits still often live in the app or a dedicated policy service.
Approximate coordination
For extremely high scale, probabilistic structures or Gossip-style sync trade perfect accuracy for overhead—only worth it when exact counts are too expensive.

Trade-off: Strong consistency on every request costs round-trips. Many teams accept small overshoot (especially with TTL-based keys) in exchange for predictable latency, and rely on downstream bulkheads for hard safety.

HTTP semantics: status codes and headers

Clients and SDKs integrate better when behavior matches common conventions:

429 Too Many Requests when rejecting due to rate limit (some APIs use 503 with Retry-After for overload—be consistent and document it).
Retry-After—seconds until the client should retry, when you can compute it (token refill time, window boundary).
Rate limit headers (de facto standard inspired by GitHub and others):
X-RateLimit-Limit, X-RateLimit-Remaining, X-RateLimit-Reset (or RateLimit-* from newer drafts).
Include policy identity if you have tiers (free vs paid).

Returning consistent JSON error bodies with a machine-readable code helps automated clients back off without parsing prose.

Keys: who and what you throttle

A limit is always per key. Typical keys:

Authenticated user or API key id
IP address (for unauthenticated endpoints—mind NAT and mobile carriers)
Tenant or organization id for B2B APIs

Combine dimensions carefully: tenant + route avoids one tenant exhausting global budget on a single cheap endpoint while starving others on expensive ones.

Practical example: Redis sliding window with INCR and EXPIRE

Below is a simplified fixed-window pattern using Redis—easy to reason about and often good enough when combined with per-route limits. For production, you would add metrics, key naming conventions, and possibly Lua for atomic multi-step logic.

// Pseudocode: fixed-window counter per client key
const WINDOW_SECONDS = 60;
const MAX_REQUESTS = 1000;

async function allowRequest(redis: Redis, clientId: string): Promise<boolean> {
  const bucket = Math.floor(Date.now() / (WINDOW_SECONDS * 1000));
  const key = `ratelimit:${clientId}:${bucket}`;

  const count = await redis.incr(key);
  if (count === 1) {
    await redis.expire(key, WINDOW_SECONDS * 2); // TTL covers current + next bucket edge
  }

  return count <= MAX_REQUESTS;
}

A token bucket in Redis often uses a Lua script to read current tokens, compute refill since last visit, decrement atomically, and set TTL on the key—avoiding race conditions between separate GET and SET calls.

For nested limits (for example 100/minute and 10,000/day), evaluate both in the same request path and fail closed on whichever trips first.

Common mistakes and pitfalls

Undocumented limits. Clients implement aggressive retries; sudden 429 responses without Retry-After or reset times increase load during incidents.

Global limits only. A single cap for the whole API lets one heavy endpoint consume the entire budget. Prefer per-route weights or separate buckets for expensive operations.

Ignoring cost asymmetry. GET /health and POST /reports/export should not share one naive counter if the latter ties up workers for seconds.

Cold start and cache stampede. After deploys or Redis failover, missing keys can cause thundering herds. Jittered backoff on the client side and synchronization in the limiter implementation matter.

Testing only on one instance. Limits that pass locally fail in staging with multiple replicas unless integration tests cover the shared store path.

Confusing rate limiting with authentication. Throttling anonymous traffic by IP is useful but not a security boundary; combine with authz and abuse detection as needed.

Conclusion

Effective API rate limiting combines an algorithm suited to your traffic shape—often token bucket or sliding window—with distributed state when you run horizontally scaled services, and clear HTTP contracts so clients backoff instead of amplifying load. The goal is not minimal code; it is predictable behavior under stress and fairness across tenants. Investing in explicit policies, observability on rejections, and documentation pays off the first time a partner ships a buggy client and your platform stays upright for everyone else.

If you are designing APIs or hardening an existing surface and want a second pair of eyes on limits, tiers, and failure modes, get in touch—especially when bridging product requirements with production-ready backend behavior.

Recibe un correo cuando se publiquen artículos nuevos. Sin spam — solo entradas nuevas de este blog.

Con Resend. Puedes darte de baja en cualquier correo.