LLM inference gateways: model routing, fallback chains, and cost controls

Route LLM traffic across models and providers with explicit policies: tiered routing, fallback chains, budget caps, and observability—without scattering provider logic through your app.

作者: Matheus Palma2026年6月16日约 11 分钟阅读

Software engineeringArtificial intelligenceBackendAPI designTypeScriptObservability

You start with one hosted model and a thin fetch wrapper. Six months later you run three providers, a self-hosted endpoint for PII-sensitive tenants, and a smaller model for classification. Product wants "premium" answers on paid plans; finance wants per-tenant spend caps; on-call gets paged when OpenAI returns 503s and your app does not fail over. Without a deliberate layer in front of inference, every route handler grows its own if (plan === 'pro') branch, retries are inconsistent, and nobody can answer how much a feature costs until the invoice arrives.

An LLM inference gateway is that layer: a single place where requests are classified, routed, retried, metered, and observed before they reach a provider. This article explains how to design one for production backends—the same patterns that show up when hardening multi-tenant assistants and API products where model spend is a first-class operational concern.

What an inference gateway is (and what it is not)

A gateway sits between your application logic and model providers. Your domain code calls something like gateway.complete(request); the gateway decides which model, which provider, and which retry policy apply.

It is not:

A vector database or RAG retriever (those feed context into the gateway).
A prompt management UI alone (versioning prompts is related but separate; the gateway consumes prompt templates, it does not replace them).
A semantic cache (caching is an optional downstream optimization; see the dedicated article on semantic caching for that design space).
A mandatory third-party product—you can implement a gateway as a module inside your API service or as a sidecar; the architecture matters more than the packaging.

The gateway's job is to make provider heterogeneity and policy explicit so product routes stay dumb and testable.

Routing dimensions: what drives model selection?

Routing is not "pick the cheapest model." Production policies usually combine several inputs:

Task type and quality tier

Different endpoints have different SLOs. A ticket classifier needs low latency and deterministic JSON; a long-form draft tolerates seconds of latency for better prose. Encode task type in the request (intent: 'classify' | 'draft' | 'summarize') or derive it from the route (POST /api/tickets/classify always maps to a routing profile).

Quality tiers (free / standard / premium) map to model families: e.g. a small fast model for free users, a frontier model for premium. The mapping should live in configuration, not scattered switch statements.

Context size and modality

Models differ in context windows and supported modalities (vision, audio, tool formats). The gateway should estimate or measure input token count early and route oversized requests to a model that fits—or reject with a clear error before burning a provider call.

Tenant, plan, and data residency

Multi-tenant SaaS often needs:

Per-tenant allowlists (certain customers contractually require EU-only inference).
Per-plan caps (max tokens per day, max concurrent streams).
PII routing (sensitive workloads to self-hosted or VPC-deployed models).

These constraints are hard filters applied before cost-based routing. A cheaper US-hosted model is not an option if the tenant's DPA forbids it.

Latency and availability signals

Advanced gateways incorporate recent provider health: elevated 429/503 rates, rising p99 latency, or circuit-breaker state from the resilience patterns you would apply to any external dependency. Route away from unhealthy endpoints; do not wait for every user request to discover an outage.

Fallback chains: graceful degradation without silent quality collapse

A fallback chain is an ordered list of (provider, model, policy) tuples tried until one succeeds or all are exhausted. Design chains with product intent, not only engineering convenience.

Typical chain shapes

Cost-optimized chain (internal tools, batch jobs):

Primary: fast/cheap model
Fallback: slightly larger model if JSON validation fails or confidence is low
Last resort: frontier model for the fraction of requests that still fail schema checks

Availability chain (customer-facing chat):

Primary: preferred vendor model
Fallback: same-capability model on alternate vendor (watch for subtle format differences in tool calls)
Degraded mode: smaller model with a user-visible disclaimer, or cached FAQ response for known intents

Compliance chain (regulated tenants):

Self-hosted or VPC model only
No fallback to public SaaS—fail closed with a support-oriented error

What to retry vs what to fall back

Not every failure should trigger the next model in the chain:

Signal	Retry same model?	Fall back?
HTTP 429 (rate limit)	Yes, with backoff + jitter	After bounded retries
HTTP 503 / timeout	Yes, limited	Yes
HTTP 400 (bad request)	No	No—fix the payload
Schema / validation failure on output	Maybe once with repair prompt	Yes, to a more capable model
Content policy filter	No	Maybe different model; often fail closed
Context length exceeded	No	Route to larger-context model if configured

Retries on idempotent read-like completions are safer than on requests that triggered side effects via tools. If the first attempt partially streamed to the client, falling back mid-response is a UX problem—gateways that support streaming need per-attempt isolation (buffer internally or abort cleanly).

Avoiding duplicate spend

A failed primary attempt may still bill tokens (especially if the model produced output before a validation failure). Log attempt index, provider request id, and token usage per attempt so finance and debugging can reconcile "why did this one user message cost 3×."

Cost controls: budgets, metering, and admission

Model spend is variable and bursty. Treat it like any other metered resource.

Per-tenant and per-feature budgets

Maintain rolling counters:

Daily / monthly token budgets per tenant or API key
Per-request max output tokens enforced at the gateway (do not rely on the client)
Concurrency limits on streaming connections (each stream ties up provider quota and your server's memory)

When a budget is exhausted, return 429 or 402 with a structured problem detail—not a generic 500. Product teams can surface upgrade paths; ops can alert before hard cutoffs.

Admission control before inference

Cheap checks belong before the provider call:

Token estimate over budget → reject immediately
Tenant suspended → reject
Feature flag off → reject or return static fallback

This pairs naturally with HTTP admission control patterns (concurrency queues, load shedding) when traffic spikes.

Cost attribution tags

Attach dimensions to every gateway call for later aggregation:

tenant_id, feature (e.g. support_copilot, invoice_ocr)
routing_profile, model, provider
cache_hit if a semantic cache sits in front

Without tags, the monthly invoice is a single line item and engineering cannot optimize what it cannot see.

Observability: traces, metrics, and quality signals

An inference gateway is the ideal instrumentation choke point.

Metrics worth exporting

Request volume by route, model, provider, outcome (success, validation_fail, budget_denied)
Latency per attempt and end-to-end (including fallback overhead)
Token usage (prompt, completion, total) per dimension
Fallback rate and retry count distributions
Estimated cost per request if you maintain a price table per model

Distributed tracing

Propagate W3C trace context into provider calls where supported. Span attributes should include llm.model, llm.attempt, llm.finish_reason, and llm.tokens.total. When a user reports a bad answer, you need to reconstruct the full chain across attempts.

Quality hooks (not just uptime)

Uptime metrics miss soft failures—valid HTTP 200 with wrong JSON or hallucinated policy. Sample completions for LLM-as-judge or rule-based checks offline; feed results back into routing (e.g. promote a model if fallback rate drops quality scores). This closes the loop between gateway telemetry and product quality.

Configuration and rollout: treat routing as deployable policy

Hardcoding chains in TypeScript works until product changes tiers weekly. Prefer versioned routing profiles stored in configuration (database, feature flag service, or structured YAML checked into git):

# routing-profiles.yaml (illustrative)
profiles:
  support_chat_standard:
    primary:
      provider: openai
      model: gpt-4.1-mini
      max_output_tokens: 1024
    fallbacks:
      - provider: anthropic
        model: claude-sonnet-4-20250514
        max_output_tokens: 1024
    budget:
      max_input_tokens: 12000
    validation:
      response_schema: support_reply_v2

  support_chat_premium:
    primary:
      provider: openai
      model: gpt-4.1
      max_output_tokens: 2048
    fallbacks:
      - provider: openai
        model: gpt-4.1-mini
        on: [rate_limit, timeout]

Roll out profile changes with feature flags or percentage traffic splits—the same discipline as server-side feature flags elsewhere in the stack. Keep old profiles addressable for replay debugging when a conversation from last Tuesday looks wrong.

Provider abstraction: thin adapters, thick policy

Resist the urge to normalize every provider quirk into one mega-interface on day one. A practical split:

Thin adapter per provider: auth, HTTP/SSE transport, request/response mapping, streaming iterators
Thick gateway: routing, retries, budgets, validation, logging

Adapters implement a small interface:

type CompletionRequest = {
  messages: Message[];
  maxOutputTokens: number;
  temperature?: number;
  responseSchema?: JsonSchema;
  tools?: ToolDefinition[];
};

type CompletionResult = {
  text: string;
  usage: { promptTokens: number; completionTokens: number };
  providerRequestId?: string;
  finishReason: "stop" | "length" | "tool_calls" | "content_filter" | "error";
};

interface LlmProviderAdapter {
  readonly name: string;
  complete(req: CompletionRequest, signal: AbortSignal): Promise<CompletionResult>;
  stream?(req: CompletionRequest, signal: AbortSignal): AsyncIterable<StreamChunk>;
}

Provider-specific tool-call formats stay inside adapters; the gateway works with your domain validation layer (Zod, JSON Schema) after the adapter returns.

When consulting teams that skipped this split, the recurring pain is testing: without adapters, you cannot unit-test routing policy without mocking fetch differently for every vendor.

Practical example: a minimal gateway module

The following TypeScript sketch wires routing profile, budget check, fallback chain, and output validation. It is intentionally compact; production code would add streaming, circuit breakers, and persistent metering.

import { z } from "zod";

const ReplySchema = z.object({
  answer: z.string(),
  confidence: z.enum(["low", "medium", "high"]),
  escalate: z.boolean(),
});

type RoutingProfile = {
  name: string;
  attempts: Array<{
    provider: LlmProviderAdapter;
    model: string;
    maxOutputTokens: number;
    retryOn?: Array<"rate_limit" | "timeout" | "validation_fail">;
  }>;
  maxInputTokens: number;
};

type GatewayContext = {
  tenantId: string;
  feature: string;
  profile: RoutingProfile;
  budget: { remainingTokens: number };
  signal: AbortSignal;
};

function estimateInputTokens(messages: Message[]): number {
  // Replace with tiktoken or provider tokenizer in production
  return messages.reduce((n, m) => n + Math.ceil(m.content.length / 4), 0);
}

export async function gatewayComplete(
  ctx: GatewayContext,
  messages: Message[],
): Promise<{ data: z.infer<typeof ReplySchema>; meta: CompletionMeta }> {
  const inputTokens = estimateInputTokens(messages);
  if (inputTokens > ctx.profile.maxInputTokens) {
    throw new BudgetError("input_too_large", { inputTokens });
  }
  if (ctx.budget.remainingTokens <= 0) {
    throw new BudgetError("tenant_budget_exhausted");
  }

  let lastError: unknown;

  for (let i = 0; i < ctx.profile.attempts.length; i++) {
    const attempt = ctx.profile.attempts[i]!;
    const maxRetries = 2;

    for (let retry = 0; retry <= maxRetries; retry++) {
      try {
        const raw = await attempt.provider.complete(
          {
            messages,
            maxOutputTokens: attempt.maxOutputTokens,
            responseSchema: replyJsonSchema,
          },
          ctx.signal,
        );

        const parsed = ReplySchema.safeParse(JSON.parse(raw.text));
        if (!parsed.success) {
          throw new ValidationError(parsed.error);
        }

        recordUsage({
          tenantId: ctx.tenantId,
          feature: ctx.feature,
          profile: ctx.profile.name,
          provider: attempt.provider.name,
          model: attempt.model,
          attemptIndex: i,
          retryIndex: retry,
          usage: raw.usage,
        });

        return {
          data: parsed.data,
          meta: {
            provider: attempt.provider.name,
            model: attempt.model,
            attemptIndex: i,
          },
        };
      } catch (err) {
        lastError = err;
        const retriable =
          isRateLimit(err) || isTimeout(err) || err instanceof ValidationError;
        if (!retriable || retry === maxRetries) break;
        await backoff(retry, ctx.signal);
      }
    }

    // Move to next model in fallback chain
    if (!(lastError instanceof ValidationError) && !isRateLimit(lastError) && !isTimeout(lastError)) {
      break; // Non-retriable at gateway level
    }
  }

  throw new GatewayExhaustedError("all_attempts_failed", { cause: lastError });
}

Key properties this example encodes:

Budget and size gates run before any provider call.
Validation failure can trigger retry, then fallback—product choice whether validation errors should escalate models.
Structured metering fires on success with attempt metadata.
Domain code receives typed data, not raw provider strings.

In freelance and consulting engagements, the teams that get the most mileage from a gateway like this are the ones that define routing profiles per product surface early—even if v1 only has one model—because adding a second model later does not require touching every handler.

Common mistakes and pitfalls

Fallback to a weaker model without UX or quality guardrails. Users notice when answers suddenly get worse; either hide degradation behind explicit "simplified mode" or reserve fallbacks for non-user-visible tasks.
Treating all providers as drop-in replacements. Tool-call JSON shapes, streaming event formats, and content-filter behavior differ. Adapters need contract tests, not assumptions.
Unbounded fallback chains. Each extra attempt adds latency and cost. Cap chain depth and total wall-clock time with a request deadline propagated from the edge.
No per-attempt billing visibility. Finance and debugging both suffer when logs only show the winning attempt.
Routing logic in controllers. Policy belongs in the gateway module or configuration; HTTP handlers should pass tenantId, feature, and messages.
Ignoring streaming semantics. Failing over mid-stream is harder than buffered completion; design streaming gateways to commit to an attempt or reset the client session explicitly.
Hardcoded API keys per route. Centralize credentials in the gateway layer with rotation support; adapters receive clients injected at startup.
Skipping shadow traffic when changing profiles. Run a percentage of requests through a candidate profile, compare quality metrics offline, then promote—same mindset as feature-flag rollouts.

Conclusion

As LLM features move from prototype to product, inference gateways become as important as API gateways were for REST microservices: they concentrate routing, resilience, cost control, and observability where provider details would otherwise leak across the codebase. Define routing profiles with explicit fallback chains, enforce budgets and admission before expensive calls, instrument every attempt, and keep adapters thin so policy remains testable.

The goal is not indirection for its own sake—it is operational leverage: ship new models, survive provider outages, and explain spend per tenant without rewriting every feature. If you are designing or hardening LLM-backed APIs and want routing, metering, and resilience designed as a coherent layer, see About for background on how this work is approached in practice; for inquiries, Contact.

新文章发布时收到邮件。无垃圾信息 — 仅本博客的新文章通知。

由 Resend 发送，可在邮件中退订。