Batch LLM inference APIs: queues, fairness, and multi-tenant guardrails

Design durable batch inference endpoints with admission control, per-tenant quotas, weighted scheduling, and observability so GPU-backed clusters stay stable when traffic spikes.

Autor: Matheus Palma21. Mai 20268 Min. Lesezeit

Software engineeringArtificial intelligenceBackendAPI designNode.jsDistributed systems

Your synchronous chat endpoint is healthy at 50 QPS, then product ships a bulk document summarization feature. Marketing runs a campaign; enterprise customers enqueue tens of thousands of PDFs overnight. Within an hour, GPUs are pegged, synchronous latency for paying interactive users collapses, and the finance team asks why inference spend tripled with no corresponding revenue lift. The failure is rarely “the model is slow.” It is almost always missing batch semantics: no durable queue, no fairness between tenants, and no admission control at the edge.

This article walks through how to treat batch LLM inference as a capacity-managed subsystem: job lifecycle, scheduling, quotas, and the operational signals you need before—not after—a traffic spike. The patterns reflect what shows up repeatedly when helping teams move from demo-grade wrappers to production-ready inference services.

Why batch is a different product than synchronous chat

Synchronous routes optimize for low tail latency on single requests. Batch workloads optimize for throughput, cost per token, and predictable completion windows (SLAs expressed as “95% of jobs finish within N hours,” not milliseconds).

That difference drives architecture:

Work is deferred: clients receive a job id immediately; results arrive via polling, webhooks, or object storage.
Batches amortize fixed costs: larger micro-batches improve GPU utilization but increase time-to-first-token for any one job inside the batch.
Failures are partial: one bad document must not poison an entire shard; you need per-item status, retries, and dead-letter handling.

If you pretend batch traffic is “just more HTTP POSTs” to the same path as chat, you inherit chat’s timeouts, connection pools, and lack of backpressure—and you lose control of the cluster.

Job model: what every enqueue record should carry

Treat each unit of work as an immutable job specification plus mutable execution state.

Immutable specification

Stable job id (UUID v7 or ULID works well): returned to the client; used in idempotent retries.
Tenant id and optional sub-tenant (workspace, project): drives quotas and isolation.
Model id and pinned revision (or a compat version you own): batch results must be reproducible for audits and support.
Input references, not inline megabyte payloads: object storage keys, signed upload ids, or content hashes. Keeps your queue rows small and lets you virus-scan or transcribe asynchronously.
Output contract: JSON schema id, max output tokens, temperature bucket—anything that changes the completion distribution belongs in the spec.

Mutable execution state

Status: queued, running, succeeded, failed, cancelled, dead_letter.
Attempt count, next_run_at for backoff, worker lease (who is running it, until when).
Progress for long jobs (pages processed, chunks completed)—cheap for UX, invaluable for ops.

The separation matters: you can hash the immutable spec for deduplication (“same summarization request submitted twice”) without conflating it with volatile worker metadata.

Queues and workers: durability before clever scheduling

At minimum, enqueue writes to a durable store (SQS, RabbitMQ, Postgres with SKIP LOCKED, Redis streams with AOF—pick what your team operates well). Workers lease jobs with a timeout, extend the lease on heartbeat, and ack only after outputs are persisted.

Why leasing matters: workers crash. Without a lease and visibility timeout, stuck jobs disappear; with naive requeueing, you double-run expensive inference. The usual compromise is: at-least-once delivery plus idempotent writes of outputs (object keys keyed by job_id, or conditional writes).

For GPU nodes, a common pattern is a thin HTTP control plane (enqueue, status) and fat consumers on GPU VMs that pull from the queue. Keep the API tier stateless; let the queue be the system of record for backlog depth.

Fairness: weighted scheduling across tenants

Naive FIFO queues starve small tenants behind one large customer’s campaign. In consulting engagements, the first production incident after a batch launch is almost always “tenant A flooded the queue.”

Practical approaches, in increasing complexity:

Per-tenant concurrency caps

Limit how many jobs from tenant T may be running simultaneously. Simple to explain, easy to implement, but head-of-line blocking still hurts: one tenant can fill the queue with waiting jobs and inflate time-in-queue for others even if running slots are balanced.

Token- or cost-weighted fair queueing

Maintain virtual finish times per tenant based on estimated GPU-seconds or billable tokens. Dequeue the runnable job with the smallest virtual finish time. This approximates weighted fair queueing and aligns incentives: heavy tenants still progress, but not at everyone else’s expense.

You do not need a textbook-perfect WFQ implementation on day one. A workable stepping stone is round-robin across tenant buckets inside each priority tier, with a global cap on dequeue rate per tenant per minute.

Priority lanes (use sparingly)

Interactive or “premium SLA” traffic on a separate queue with reserved worker share avoids mixing batch best-effort work with revenue-critical paths. The pitfall is priority inversion if premium jobs depend on the same GPU pool without reserved capacity—document explicit capacity floors or accept that priorities are soft.

Admission control at the API edge

Queues solve durability; admission control solves overload. Before you accept a job:

Authenticate and resolve tenant quotas (remaining daily tokens, max concurrent jobs, max queued depth).
Validate payload size and format at the edge; reject early with 413 or 422 rather than after dequeue.
Return 429 or 503 with a clear Retry-After when the tenant or global queue depth exceeds policy. Clients should back off; your API should not absorb unbounded promises.

This is where batch diverges sharply from chat: a well-behaved batch client expects backpressure and schedules work across a time window. A chat client interprets 503 as outage.

Batching inside the worker: micro-batches without silent coupling

Workers often pack multiple compatible jobs into one forward pass (same model revision, similar sequence lengths) to raise MFU (model FLOPs utilization). Trade-offs:

Higher throughput, lower per-job latency inside the batch.
Coupled failure domains: if the kernel throws, more jobs retry. Mitigate with smaller micro-batches under error rates and per-job isolation in output tensors / post-processing.
Fairness distortion: always-packing the largest jobs first can starve short jobs; mix batch sizes or enforce maximum wait time in queue before partial send.

Results delivery: least privilege and least surprise

Avoid returning large completions inline in status polling responses. Prefer:

Write outputs to object storage; status returns a short ETag and signed download URL with tight expiry.
Webhooks with signed payloads and retries; treat webhook delivery as its own small queue with DLQ.

For regulated workloads, log who fetched which artifact, not just that generation succeeded.

Observability: the metrics that actually steer capacity

Minimum useful set:

Queue depth and age of oldest runnable job (SLO burn for “time-to-start”).
Time-in-queue vs GPU-active time—if queue waits are low but GPUs idle, your dequeue or packing logic is wrong; if queues grow linearly, you are under-provisioned or over-admitted.
Per-tenant tokens generated, retry rate, dead-letter rate.
Saturation: GPU utilization, power limit throttling, PCIe/NVLink stalls—correlate with batch shapes.

Dashboards should answer: “If we 2x enterprise signups tomorrow, which knob—concurrency, fair weights, or hardware—do we turn first?”

Practical example: enqueue with tenant depth guard

The following TypeScript sketch shows an API handler that checks per-tenant queued depth before writing a job row and enqueuing work. It is illustrative: swap the in-memory counter for Redis INCR with TTL, or for a SQL COUNT(*) with indexes, depending on your store.

import { randomUUID } from "crypto";

type JobSpec = {
  tenantId: string;
  model: "acme-embed-v3";
  inputObjectKey: string;
  maxOutputTokens: number;
};

type EnqueueResult =
  | { ok: true; jobId: string }
  | { ok: false; reason: "QUEUE_DEPTH"; retryAfterSec: number };

const MAX_QUEUED_PER_TENANT = 5000;

/** Example counters; production would use Redis/DB. */
const queuedDepth = new Map<string, number>();

export async function enqueueJob(spec: JobSpec): Promise<EnqueueResult> {
  const depth = queuedDepth.get(spec.tenantId) ?? 0;
  if (depth >= MAX_QUEUED_PER_TENANT) {
    return { ok: false, reason: "QUEUE_DEPTH", retryAfterSec: 60 };
  }

  const jobId = randomUUID();

  // 1) Persist immutable spec + status=queued (transaction).
  await persistJobRow({ jobId, spec, status: "queued", enqueuedAt: Date.now() });

  // 2) Publish to durable queue (partition key = tenantId for ordering if needed).
  await publishToQueue({ jobId, tenantId: spec.tenantId });

  queuedDepth.set(spec.tenantId, depth + 1);
  return { ok: true, jobId };
}

declare function persistJobRow(
  row: {
    jobId: string;
    spec: JobSpec;
    status: "queued";
    enqueuedAt: number;
  },
): Promise<void>;

declare function publishToQueue(msg: {
  jobId: string;
  tenantId: string;
}): Promise<void>;

The important idea is not the map—it is the contract: overload surfaces as a typed error with retry guidance, not as a wedged database or an hours-late silent failure.

Common mistakes and pitfalls

Unbounded POST bodies for document text. Stream to object storage first; queue references only.
No lease / visibility timeout on workers, leading to duplicate expensive inference or lost jobs.
Global FIFO without tenant fairness, which guarantees an incident the first time two large customers share a cluster.
Mixing batch and synchronous traffic on the same GPU pool without separate quotas—interactive latency becomes the canary for batch overload.
Missing pinned model revisions in job specs, making it impossible to explain why two “identical” jobs differ after a rollout.
Polling clients without exponential backoff, turning status checks into a self-inflicted DDoS on your API tier.

Conclusion

Batch LLM inference is capacity management dressed up as an API. Durable queues, explicit job models, tenant-scoped fairness, and edge admission control are what keep GPU spend and tail latency aligned with business expectations—not a larger autoscaling group alone. Invest early in queue-age metrics and per-tenant accounting; they pay off the first time marketing runs a campaign the same night a Fortune 500 customer loads a year of archives.

If you are designing inference platforms or hardening an existing deployment, it helps to treat scheduling, storage, and observability as one system. For engineering background and collaboration, see About and Contact.

E-Mail erhalten, wenn neue Artikel erscheinen. Kein Spam — nur neue Beiträge von diesem Blog.

Über Resend. Abmeldung in jeder E-Mail möglich.