LLM model routing and fallback chains: latency budgets, quality tiers, and controlled degradation

Route LLM traffic across primary and fallback models with explicit latency budgets, cost tiers, and circuit breakers—so outages and tail latency degrade gracefully instead of taking down your API.

Autor: Matheus Palma6 de junio de 202611 min de lectura

Software engineeringArtificial intelligenceBackendAPI designTypeScriptResilience

Your primary model provider returns 503 for twenty minutes during a regional incident. Your API has no fallback: every chat request fails, support tickets spike, and the product team asks whether you should “just add another vendor.” You add a second provider over a weekend, wire it as a catch-all backup, and discover a new problem—fallback traffic is three times slower and twice as expensive, but users cannot tell which model answered, so quality regressions hide until someone files a bug with a screenshot of a confidently wrong refund policy.

Model routing is how you decide which model answers a request before you spend tokens. Fallback chains are what happens when that choice fails: timeout, rate limit, schema violation, or provider outage. Together they are the difference between an LLM feature that survives production and one that treats a single endpoint like an infinite, always-correct function. The patterns show up in every serious LLM-backed API I have helped teams harden: the goal is not vendor diversity for its own sake, but predictable behavior under dependency failure.

What model routing is (and what it is not)

Model routing selects a (provider, model, parameters) tuple for each request based on policy, not on whatever string happened to be hard-coded in the route handler.

It is:

A product and reliability control—you choose speed vs. quality vs. cost per route, tenant, or feature flag.
A dependency isolation layer—when the premium model is unhealthy, traffic shifts without redeploying application code.
An observability surface—every decision should emit model_selected, fallback_reason, and latency_ms so you can audit outcomes.

It is not:

Prompt routing alone (classifying intent to pick a tool). That can inform model choice, but routing is about which inference endpoint serves the work.
A replacement for caching, RAG, or admission control. Those layers sit adjacent; routing decides who generates after you have already admitted the request.
Automatic quality equivalence. Smaller models are cheaper and faster; they are not interchangeable without measuring task-specific regression.

Designing a routing policy: dimensions that must be explicit

Before writing fallback logic, write down the axes your router can see. Ambiguity here produces subtle bugs—especially in multi-tenant SaaS.

Route and task type

Different endpoints have different SLOs. A real-time copilot may cap wait time at 2.5 seconds; a nightly batch summarizer can tolerate minutes. Encode taskClass (interactive, batch, extraction) in the routing context so policies do not leak across routes.

Tenant tier and budget

Enterprise tenants paying for premium support may always get the primary model. Free-tier traffic might default to a smaller model with token ceilings. Budget exhaustion should fail visibly (HTTP 429 with a clear code) rather than silently downgrading to a model that hallucinates legal text.

Latency budget

Define a remaining deadline per request—derived from the client timeout minus gateway overhead—and pass it to the router. A fallback that would exceed the budget is worse than a fast error with a retry hint.

Output contract strictness

Structured extraction (JSON schema, tool calls) needs models and decoding settings that reliably emit parseable output. A fallback model that frequently breaks schema is not a fallback; it is a new failure mode. Tag routes as strictSchema: true and restrict fallbacks to models that pass offline eval on your golden set.

Data residency and compliance

Some tenants require inference in specific regions or prohibit certain subprocessors. Routing constraints are hard filters, not soft preferences—never fall back across a compliance boundary.

Fallback chains: ordering, triggers, and stop conditions

A fallback chain is an ordered list of candidates, each with entry conditions and exit triggers. Think of it like DNS resolution: try the preferred answer, but stop when you have a good enough result or you run out of time.

Common fallback triggers

Trigger	Example	Typical response
Hard HTTP error	`502`, `503`, `529`	Try next provider/model
Rate limit	`429` with `Retry-After`	Next candidate or bounded wait
Timeout	No first token within budget	Next candidate
Soft failure	Empty completion, schema validation error	Retry once on same model, then next candidate
Circuit open	Breaker on provider A	Skip A entirely for cooldown window

Chain shape: degrade quality before crossing vendors

Two patterns work well in production:

Quality degradation (same vendor).
gpt-4.1 → gpt-4.1-mini → gpt-4.1-nano
Keeps SDK integration, auth, and logging uniform. Best when the outage is model-specific, not platform-wide.

Provider failover (cross vendor).
anthropic/claude-sonnet → openai/gpt-4.1 → self-hosted/llama
Higher integration cost, but survives regional or vendor incidents. Requires normalized internal contracts so the rest of your app never sees provider-specific response shapes.

Avoid chains longer than three hops for interactive traffic. Each hop consumes deadline budget and multiplies cost attribution complexity.

When not to fallback

Some failures should surface to the user instead of silently switching models:

Prompt injection or policy block — retrying on another model may bypass safety filters inconsistently.
Context length exceeded — falling back to a smaller context window truncates silently; return 413 or ask the client to summarize.
Authentication or billing errors — not transient; fix credentials, do not rotate models.

Latency budgets and “time-boxed” inference

Tail latency is the main reason users perceive LLM features as broken. Routing should treat time as a first-class input.

First-token deadline vs. total deadline

For streaming routes, measure two clocks:

Time to first token (TTFT) — if the primary model has not started streaming within, say, 800 ms, consider a faster fallback before the user sees a frozen UI.
Total completion deadline — hard cap on wall time; cancel upstream generators when exceeded.

Pass AbortSignal through your stack so a late fallback does not leave orphan streams billing tokens in the background.

Hedged requests (use sparingly)

Hedging fires a second request to an alternate model if the first has not responded by a threshold, then takes whichever completes first and cancels the loser. It cuts tail latency but always increases cost on slow paths and can amplify rate-limit pressure during incidents—when everyone hedges, everyone loses. If you hedge, do it only on idempotent, read-like routes with strict spend caps and feature-flag kill switches.

Integrating circuit breakers and health signals

Model routing composes naturally with circuit breakers (see also circuit breakers, bulkheads, and timeouts): track failure rate and latency per (provider, model) tuple.

Practical signals:

Error rate over a sliding window (5xx, timeouts, schema hard-fails).
Saturation — sustained 429 responses.
p95 latency vs. historical baseline.

When a breaker opens, the router skips that candidate entirely and logs breaker_open as the selection reason. Half-open probes should use synthetic canary prompts on a background schedule, not production user traffic—users should not pay for your health checks.

In consulting engagements, the recurring mistake is a global breaker on “LLM” as one blob. Vendor A can be healthy while model X is not; granular breakers prevent unnecessary full failover.

Cost tiers and observability: make degradation auditable

Every routing decision should emit structured fields:

{
  "route": "support.copilot",
  "selected": "openai/gpt-4.1-mini",
  "chain_position": 2,
  "reason": "primary_timeout",
  "primary_candidate": "openai/gpt-4.1",
  "latency_ms": 1240,
  "input_tokens": 1820,
  "output_tokens": 340,
  "tenant_id": "t_abc",
  "task_class": "interactive"
}

Dashboard fallback rate by route and tenant. Alert when it crosses a baseline—often the first sign of provider degradation or a too-tight timeout. Correlate with human feedback (thumbs-down, escalation to human agent) to detect quality cliffs that latency metrics miss.

Cost attribution belongs in the same event: finance will ask whether failover doubled spend during an incident; you should answer without a week of log archaeology.

Practical example: policy-driven router with deadlines and schema guard

The following TypeScript sketch shows a declarative chain, per-candidate timeouts, schema validation, and circuit-breaker skipping. It is illustrative—adapt storage, breakers, and provider clients to your stack.

import { z } from "zod";

type TaskClass = "interactive" | "batch" | "extraction";

type RoutingContext = {
  route: string;
  taskClass: TaskClass;
  tenantId: string;
  tier: "free" | "pro" | "enterprise";
  strictSchema: boolean;
  deadlineMs: number;
  startedAt: number;
};

type Candidate = {
  id: string;
  provider: string;
  model: string;
  maxTokens: number;
  ttftBudgetMs: number;
};

type CompletionResult = {
  text: string;
  inputTokens: number;
  outputTokens: number;
};

type BreakerState = "closed" | "open" | "half_open";

const SupportAnswerSchema = z.object({
  answer: z.string(),
  citations: z.array(z.string()).optional(),
});

const CHAINS: Record<string, Candidate[]> = {
  "support.copilot": [
    { id: "primary", provider: "openai", model: "gpt-4.1", maxTokens: 800, ttftBudgetMs: 900 },
    { id: "fast", provider: "openai", model: "gpt-4.1-mini", maxTokens: 800, ttftBudgetMs: 600 },
    { id: "failover", provider: "anthropic", model: "claude-sonnet-4", maxTokens: 800, ttftBudgetMs: 900 },
  ],
  "docs.extraction": [
    { id: "primary", provider: "openai", model: "gpt-4.1", maxTokens: 1200, ttftBudgetMs: 2000 },
    { id: "strict-fallback", provider: "openai", model: "gpt-4.1-mini", maxTokens: 1200, ttftBudgetMs: 1500 },
  ],
};

function remainingMs(ctx: RoutingContext): number {
  return Math.max(0, ctx.deadlineMs - (Date.now() - ctx.startedAt));
}

function breakerAllows(candidateId: string, getState: (id: string) => BreakerState): boolean {
  const state = getState(candidateId);
  return state !== "open";
}

async function completeWithDeadline(
  candidate: Candidate,
  prompt: string,
  budgetMs: number,
  signal: AbortSignal,
): Promise<CompletionResult> {
  const controller = new AbortController();
  const onAbort = () => controller.abort();
  signal.addEventListener("abort", onAbort);

  const timer = setTimeout(() => controller.abort(), budgetMs);
  try {
    return await callProvider({
      provider: candidate.provider,
      model: candidate.model,
      prompt,
      maxTokens: candidate.maxTokens,
      signal: controller.signal,
    });
  } finally {
    clearTimeout(timer);
    signal.removeEventListener("abort", onAbort);
  }
}

export async function routeAndComplete(params: {
  ctx: RoutingContext;
  prompt: string;
  parentSignal: AbortSignal;
  getBreakerState: (candidateId: string) => BreakerState;
  callProvider: (args: {
    provider: string;
    model: string;
    prompt: string;
    maxTokens: number;
    signal: AbortSignal;
  }) => Promise<CompletionResult>;
  onDecision: (event: Record<string, unknown>) => void;
}): Promise<{ text: string; selected: string }> {
  const chain = CHAINS[params.ctx.route];
  if (!chain) throw new Error(`No chain for route ${params.ctx.route}`);

  let lastError: unknown;

  for (let i = 0; i < chain.length; i++) {
    const candidate = chain[i];
    const left = remainingMs(params.ctx);
    if (left < 250) break;

    if (!breakerAllows(candidate.id, params.getBreakerState)) {
      params.onDecision({ skipped: candidate.id, reason: "breaker_open" });
      continue;
    }

    const attemptBudget = Math.min(candidate.ttftBudgetMs, left);
    try {
      const result = await completeWithDeadline(
        candidate,
        params.prompt,
        attemptBudget,
        params.parentSignal,
      );

      if (params.ctx.strictSchema) {
        const parsed = SupportAnswerSchema.safeParse(JSON.parse(result.text));
        if (!parsed.success) {
          lastError = new Error("schema_validation_failed");
          params.onDecision({ selected: candidate.id, reason: "schema_fail", chain_position: i });
          continue;
        }
        params.onDecision({
          selected: candidate.id,
          chain_position: i,
          reason: i === 0 ? "primary_ok" : "fallback_ok",
          input_tokens: result.inputTokens,
          output_tokens: result.outputTokens,
        });
        return { text: result.text, selected: candidate.id };
      }

      params.onDecision({
        selected: candidate.id,
        chain_position: i,
        reason: i === 0 ? "primary_ok" : "fallback_ok",
      });
      return { text: result.text, selected: candidate.id };
    } catch (err) {
      lastError = err;
      params.onDecision({
        selected: candidate.id,
        chain_position: i,
        reason: "attempt_failed",
        error: err instanceof Error ? err.message : "unknown",
      });
    }
  }

  throw lastError ?? new Error("all_candidates_exhausted");
}

declare function callProvider(args: {
  provider: string;
  model: string;
  prompt: string;
  maxTokens: number;
  signal: AbortSignal;
}): Promise<CompletionResult>;

Notice four production details easy to skip in a demo:

remainingMs — fallbacks get whatever time is left, not a fresh full timeout.
Breaker skip — unhealthy candidates are not retried on every request during an outage.
Schema gate — soft failure triggers the next hop instead of returning garbage to the client.
onDecision telemetry — selection is observable for SLO and cost dashboards.

Wire this behind your existing admission control so you do not amplify provider incidents by accepting unbounded concurrent failover storms.

Common mistakes and pitfalls

Silent cross-tenant failover — routing policies that ignore tenantId and tier can downgrade enterprise SLAs or route regulated data to the wrong region. Encode constraints in the router, not in tribal knowledge.
Fallback without eval — swapping models without task-specific golden tests is how “the backup model” becomes a production incident. Measure extraction F1, citation accuracy, or human-rated quality per candidate.
Unbounded chain retries — three models × two retries × 30 s timeout = a hung request that exhausts workers. Cap attempts and respect parent AbortSignal.
Ignoring streaming cancellation — client disconnect must abort upstream provider streams; otherwise failover during incidents increases cost exactly when you are trying to save capacity.
Global default model env var — LLM_MODEL=gpt-4.1 in twelve microservices means routing changes require coordinated deploys. Centralize policy in configuration with versioned rollout (feature flags help).
Treating all errors as transient — 400 for context length and 401 for bad keys should not trigger model rotation; they need different handlers.
No user-visible degradation mode — when all candidates fail, return a structured error (503 with retry_after) and a UI fallback (“try again” / human handoff), not an empty message.

Conclusion

LLM model routing and fallback chains turn a fragile single-vendor integration into a dependency-aware inference layer. The implementation is straightforward; the discipline is in policy—latency budgets, schema strictness, tenant tiers, breaker granularity—and in telemetry that makes every degradation path visible.

Teams that treat routing as an afterthought usually discover its importance during the first provider outage. Designing chains, deadlines, and eval gates up front is how LLM features stay fast, affordable, and explainable when the primary model is slow, expensive, or down. If you are building or hardening production LLM APIs and want routing, caching, and observability designed as one system, see About for background on this work; for collaboration or inquiries, Contact.

Recibe un correo cuando se publiquen artículos nuevos. Sin spam — solo entradas nuevas de este blog.

Con Resend. Puedes darte de baja en cualquier correo.