LLM cost governance in production APIs: token budgets, model routing, and spend guardrails
Per-tenant token budgets, model routing tiers, pre-flight estimation, and circuit breakers that stop LLM spend from becoming an incident—patterns for multi-tenant backends.
Your assistant feature ships on Friday. By Monday, a single integration test left running in staging has burned through a month’s inference budget. By Wednesday, a power user pastes a 40-page PDF into every thread and your P99 latency doubles while finance asks why the OpenAI invoice tripled. The model did nothing wrong—you never defined who is allowed to spend how much, on which model, under what failure mode when limits are hit.
LLM APIs bill in tokens, not requests. That pricing model turns familiar backend problems—rate limiting, quotas, degradation—into cost incidents that show up on a vendor invoice instead of a CPU graph. This article walks through cost governance for production LLM backends: per-tenant budgets, routing between model tiers, pre-flight estimation, and guardrails that fail closed without silently breaking the product.
Why request-based rate limits are not enough
Traditional API rate limiting counts requests per minute. That works when each request has roughly similar cost. LLM calls do not:
| Input pattern | Relative cost |
|---|---|
| Short FAQ answer | Low |
| Long document + tool loop (3–5 hops) | High |
| Batch summarization of attachments | Very high |
| “Retry on any JSON parse error” loop | Unbounded |
A client that stays under 100 req/min can still trigger thousands of dollars in spend if each request carries a 100k-token context and the model retries internally. Cost governance starts by metering tokens (prompt + completion, and tool payloads where applicable), not HTTP calls alone.
In consulting engagements on assistant backends, the first production incident is rarely “the model hallucinated”—it is runaway spend from missing budgets, a misconfigured cron job, or a client that treats the API like a free batch processor.
The governance stack: four layers
Think of cost control as layered defenses, similar to security:
- Admission — Reject or queue work before it reaches the model (budget check, max input size, concurrency cap).
- Routing — Send cheap work to cheaper models; escalate only when signals justify it.
- Execution — Bound tool loops, streaming timeouts, and max output tokens per call.
- Reconciliation — Persist actual usage, reconcile estimates, alert on drift, and block tenants that exceed policy.
Skipping any layer leaves a hole. Admission without reconciliation means you discover overages on the invoice. Routing without execution bounds means a “small” model can still loop on tools until spend explodes.
Per-tenant token budgets
A token budget is a rolling or calendar window cap: e.g. 500k prompt+completion tokens per organization per day. Store counters in a low-latency store (Redis, PostgreSQL with UPSERT, or your billing vendor’s meter).
Hard vs soft limits
- Soft limit — Warn, throttle, or downgrade model tier; user experience degrades but core flows continue.
- Hard limit — Return
429or402 Payment Requiredwith a Problem Details body; no model call is made.
Production systems usually combine both: soft at 80%, hard at 100%. The soft tier is where model routing (below) does most of its work.
What to count
At minimum, increment counters with vendor-reported usage after each completion. For pre-flight admission, maintain a conservative estimate:
type UsageRecord = {
tenantId: string;
promptTokens: number;
completionTokens: number;
model: string;
requestId: string;
recordedAt: string; // ISO
};
/** Rough pre-flight bound before calling the provider */
export function estimatePromptTokens(messages: { content: string }[]): number {
const chars = messages.reduce((n, m) => n + m.content.length, 0);
// Conservative: ~4 chars/token for English-ish text; tune per tokenizer
return Math.ceil(chars / 3.5);
}
Underestimating prompt size causes budget overruns; overestimating frustrates legitimate users. Reconcile estimates against actual usage fields from the provider response and tune the heuristic per locale and content type (code and CJK text differ).
Idempotency and double-charging
Retries must not double-count. Tie usage increments to an idempotency key per completion (the same pattern as payment APIs). If the client retries POST /chat with the same key, return the cached response without incrementing usage again.
Model routing: spend-aware tier selection
Model routing chooses which model handles a request. Common tiers:
| Tier | Typical use | Cost profile |
|---|---|---|
| Fast / small | Classification, routing, short replies | Lowest |
| Standard | General assistant turns | Medium |
| Large / reasoning | Complex analysis, long context | Highest |
Routing signals include:
- Remaining budget — Below soft threshold → downgrade one tier.
- Task type — Explicit
intentfrom client or classifier (support vs analysis). - Context size — Above N estimated tokens → require large-context model or reject with “document too large.”
- Latency SLO — Under load, prefer faster/cheaper models for non-critical paths.
Avoid “always use the best model.” That is the LLM equivalent of running every query on the largest database instance.
Escalation, not default-up
A robust pattern is start cheap, escalate on failure:
- Attempt with small model and structured output schema.
- If validation fails or confidence is low (logprobs, self-reported uncertainty, or a cheap judge), escalate once.
- Cap escalations per request (e.g. max one step-up) to prevent loops.
type ModelTier = "fast" | "standard" | "large";
const TIER_MODEL: Record<ModelTier, string> = {
fast: "gpt-4.1-mini",
standard: "gpt-4.1",
large: "gpt-4.1",
};
export function selectTier(input: {
estimatedPromptTokens: number;
budgetRemainingRatio: number; // 0..1
taskPriority: "low" | "normal" | "high";
}): ModelTier {
if (input.budgetRemainingRatio < 0.1 && input.taskPriority !== "high") {
return "fast";
}
if (input.estimatedPromptTokens > 32_000) {
return "large";
}
return input.taskPriority === "low" ? "fast" : "standard";
}
Document routing decisions in logs (tenant, tier, reason code) so finance and product can audit why spend spiked.
Pre-flight admission: stop expensive work early
Before fetch() to the provider:
- Max input bytes / tokens — Reject oversized attachments at upload time, not in the chat handler.
- Concurrency per tenant — Limit in-flight LLM calls; queue or reject when saturated (see admission control patterns for HTTP APIs).
- Budget check —
estimatedPrompt + maxCompletion <= remaining; if not, fail without calling the model. - Global spend circuit breaker — If platform-wide spend in the last hour exceeds a threshold, shed non-critical traffic (similar to load shedding tiers).
The circuit breaker protects your margin when a provider price change, bug, or attack affects all tenants. Pair it with paging—not silent failure.
export class SpendCircuitOpenError extends Error {
readonly code = "SPEND_CIRCUIT_OPEN";
}
export async function assertTenantBudget(
redis: { get: (k: string) => Promise<string | null> },
tenantId: string,
estimatedTokens: number,
dailyCap: number,
): Promise<void> {
const key = `llm:usage:daily:${tenantId}`;
const used = Number((await redis.get(key)) ?? 0);
if (used + estimatedTokens > dailyCap) {
throw new SpendCircuitOpenError(`Tenant ${tenantId} daily token budget exceeded`);
}
}
Return RFC 9457 Problem Details with type URIs your clients can branch on (/problems/token-budget-exceeded, /problems/payload-too-large).
Execution bounds: where unbounded spend hides
Even with admission and routing, tool-call loops and streaming without caps cause surprises:
- Max tool rounds per user turn — e.g. 5; return a clear error when exceeded.
- Max completion tokens — Set per tier; do not rely on the model to stop politely.
- Wall-clock deadline — Cancel the upstream request when exceeded (request deadlines and cancellation apply here).
- Duplicate tool calls — Detect identical tool invocations in a loop and break (common failure mode with weaker models).
These bounds belong in your orchestration layer, not scattered in prompts. Prompts are hints; code is enforcement.
Reconciliation, alerting, and product UX
Governance fails if only engineers see the numbers.
- Persist every completion’s usage to an analytics table or warehouse; aggregate by tenant, model, feature flag, and route.
- Alert on anomaly: 3× daily baseline, single tenant >X% of platform spend, or estimate vs actual drift >20%.
- Surface usage in-product for B2B: “72% of monthly AI quota used” reduces surprise upgrades and support tickets.
When a hard limit triggers, the UX should explain what happened and what to do (“Upgrade plan”, “Remove attachment”, “Try again tomorrow”)—not a generic 500.
Practical example: guarded chat handler
The handler below ties together budget check, tier selection, a single model call with output cap, and post-hoc usage recording. It is illustrative—not a full SDK wrapper—but matches what you would deploy behind POST /v1/chat.
import { createHash } from "node:crypto";
type ChatMessage = { role: "system" | "user" | "assistant"; content: string };
type ChatRequest = {
tenantId: string;
idempotencyKey: string;
messages: ChatMessage[];
maxCompletionTokens?: number;
priority?: "low" | "normal" | "high";
};
type ChatResponse = {
message: ChatMessage;
model: string;
usage: { promptTokens: number; completionTokens: number };
};
type LlmClient = {
complete(input: {
model: string;
messages: ChatMessage[];
maxCompletionTokens: number;
signal: AbortSignal;
}): Promise<ChatResponse>;
};
type UsageStore = {
getDailyUsage(tenantId: string): Promise<number>;
incrementDailyUsage(tenantId: string, tokens: number): Promise<void>;
getCachedResponse(key: string): Promise<ChatResponse | null>;
setCachedResponse(key: string, res: ChatResponse, ttlSec: number): Promise<void>;
};
const DAILY_CAP = 500_000;
const MAX_COMPLETION_BY_TIER = { fast: 512, standard: 2_048, large: 4_096 } as const;
export async function handleChat(
req: ChatRequest,
deps: { llm: LlmClient; usage: UsageStore; dailyCap?: number },
): Promise<ChatResponse> {
const cap = deps.dailyCap ?? DAILY_CAP;
const cacheKey = `idem:${req.tenantId}:${req.idempotencyKey}`;
const cached = await deps.usage.getCachedResponse(cacheKey);
if (cached) return cached;
const estimatedPrompt = estimatePromptTokens(req.messages);
const used = await deps.usage.getDailyUsage(req.tenantId);
const remainingRatio = Math.max(0, (cap - used) / cap);
if (used + estimatedPrompt + 256 > cap) {
const err = new Error("Token budget exceeded");
(err as Error & { status: number }).status = 429;
throw err;
}
const tier = selectTier({
estimatedPromptTokens: estimatedPrompt,
budgetRemainingRatio: remainingRatio,
taskPriority: req.priority ?? "normal",
});
const model = TIER_MODEL[tier];
const maxCompletion = Math.min(
req.maxCompletionTokens ?? MAX_COMPLETION_BY_TIER[tier],
MAX_COMPLETION_BY_TIER[tier],
);
const controller = new AbortController();
const timeout = setTimeout(() => controller.abort(), 60_000);
let result: ChatResponse;
try {
result = await deps.llm.complete({
model,
messages: req.messages,
maxCompletionTokens: maxCompletion,
signal: controller.signal,
});
} finally {
clearTimeout(timeout);
}
const totalTokens = result.usage.promptTokens + result.usage.completionTokens;
await deps.usage.incrementDailyUsage(req.tenantId, totalTokens);
await deps.usage.setCachedResponse(cacheKey, result, 86_400);
return result;
}
function estimatePromptTokens(messages: ChatMessage[]): number {
const chars = messages.reduce((n, m) => n + m.content.length, 0);
return Math.ceil(chars / 3.5);
}
function selectTier(input: {
estimatedPromptTokens: number;
budgetRemainingRatio: number;
taskPriority: "low" | "normal" | "high";
}): keyof typeof MAX_COMPLETION_BY_TIER {
if (input.budgetRemainingRatio < 0.1 && input.taskPriority !== "high") return "fast";
if (input.estimatedPromptTokens > 32_000) return "large";
return input.taskPriority === "low" ? "fast" : "standard";
}
const TIER_MODEL = {
fast: "gpt-4.1-mini",
standard: "gpt-4.1",
large: "gpt-4.1",
} as const;
/** Optional: deterministic idempotency key from client payload */
export function hashIdempotencyPayload(tenantId: string, body: unknown): string {
return createHash("sha256").update(tenantId).update(JSON.stringify(body)).digest("hex");
}
Wire incrementDailyUsage with TTL aligned to your billing window (UTC midnight or rolling 24h—pick one and document it). Expose tier and usage in response headers (X-LLM-Model, X-Token-Usage) for client telemetry.
Common mistakes and pitfalls
- Metering only completions — Prompt tokens often dominate; ignoring them breaks budget math.
- Retry without idempotency — Transient 502s double usage and teach clients to hammer retries.
- Soft limits with no enforcement path — Warning emails do not stop a runaway script; hard caps or circuit breakers do.
- Routing in prompts only — “Use the cheap model when possible” is not auditable; encode routing in code with reason codes.
- Unbounded tool loops — The cost incident shows up as “one user message” with fifteen hidden model calls.
- Staging shares production keys without caps — Separate projects, keys, and budgets per environment.
- Hiding spend from customers — B2B buyers expect quota visibility; surprises erode trust faster than model quality issues.
Conclusion
LLM cost governance is part of API product design, not a finance afterthought. Meter tokens, enforce per-tenant budgets with hard and soft tiers, route to appropriate models, bound execution (tools, completions, deadlines), and reconcile actual usage against estimates. The teams that treat spend like capacity—measurable, limited, and observable—ship assistants that survive real traffic without turning inference into an unbounded operational risk.
If you are hardening a multi-tenant AI backend or untangling runaway inference bills, the highest-leverage work is usually admission and metering at the boundary—before the model sees the request. That is the same discipline you would apply to any production API that bills by consumption: define the contract, enforce it in code, and make limits visible to operators and users alike.
Suscríbete al boletín
Recibe un correo cuando se publiquen artículos nuevos. Sin spam — solo entradas nuevas de este blog.
Con Resend. Puedes darte de baja en cualquier correo.