LLM provider routing: fallback chains, hedging, and cost-aware model selection

Route production LLM traffic across models and vendors with explicit fallback chains, hedged requests, circuit breakers, and tier-aware selection—without silent quality regressions.

Autor: Matheus Palma10 de junho de 202612 min de leitura

Software engineeringArtificial intelligenceBackendAPI designReliabilityTypeScript

Your assistant ships on a single hosted model. Tuesday morning the provider returns 503 for twenty minutes; your product is down because there is no second path. Thursday afternoon you add a cheaper model for “simple” questions; p95 latency improves, but support tickets spike because the router sends nuanced policy questions to a model that confidently hallucinates. Friday, finance asks why spend doubled—two teams each added a retry loop that fans out duplicate completions on timeout. None of these failures require exotic infrastructure. They come from treating model choice as a compile-time constant instead of a routing policy you can observe, test, and change.

This article explains how to build LLM provider routing as a production control plane: explicit fallback chains, optional hedging, circuit breaking, and cost-aware selection tied to product tiers. The patterns apply whether you call one vendor with multiple models or several OpenAI-compatible endpoints. In consulting work on copilots and internal tools, the teams that survive provider incidents are not the ones with the best prompt—they are the ones that can shift traffic without redeploying the UI.

What LLM routing is (and what it is not)

LLM routing decides, per request, which provider, model, and decoding profile will handle a completion—possibly with retries across alternatives when the primary path fails or exceeds a budget.

It is not:

Prompt engineering — routing chooses who answers; prompts define how they answer.
Semantic caching — caching reuses prior outputs; routing picks a fresh inference path.
Load balancing across identical replicas — that is ordinary HTTP balancing. Routing here means different models with different cost, latency, and capability profiles.

The routing layer sits between your application domain and the transport client:

HTTP handler → policy (tier, task type, budget) → router → provider client(s) → completion
                     ↑                                    ↓
              feature flags / kill switches         metrics + circuit state

Keeping this boundary explicit lets you change models in staging, shadow traffic to a candidate model, or drain a degraded provider without touching business logic.

Dimensions of a routing decision

A production router should evaluate structured inputs, not a single global default.

Task type and capability requirements

Not every route needs the largest model. Classify work before you call anything:

Task signal	Typical routing bias
Short classification, extraction, routing label	Small / fast model
Long-form drafting, multi-step reasoning	Larger model
Tool calls with strict JSON schema	Model with reliable structured output
PII-heavy or regulated content	Approved model list per tenant

Classification can be rules-based (route metadata, content length, locale), a lightweight classifier model, or a hybrid. The important part is that capability requirements are explicit in code or configuration, not implied by whichever model was hard-coded last quarter.

Tenant tier and budget

Multi-tenant products need per-tenant ceilings: max tokens, allowed model families, and optional “premium override” flags. A startup on a free tier and an enterprise customer paying for SLA should not share one undifferentiated gpt-4o default without you meaning to.

Store budgets in durable counters (Redis, Postgres rollups) or provider-side limits where available. Routing should fail closed when a tenant exceeds budget—return a clear error or queue for async processing—rather than silently downgrading to a model that cannot meet contractual quality.

Latency and cost budgets

Attach SLA classes to routes:

Interactive — optimize for time-to-first-token; prefer streaming; tight idle timeouts.
Background — optimize for cost; allow batch APIs or cheaper models; longer deadlines.

Express budgets as maximum prompt + completion tokens, maximum wall-clock time, and maximum USD per request (estimated from token price tables). The router rejects or reroutes before you burn the month's inference budget on a single abusive loop.

Provider health and circuit state

Treat each (provider, model, region) tuple as a dependency with its own error rate, latency distribution, and circuit breaker. A model that is "up" globally may be degraded for your API key or region. Routing must read live health, not yesterday's dashboard.

Fallback chains: ordered degradation with contracts

A fallback chain is an ordered list of candidates: try A, on specific failures try B, then C. This is the minimum viable multi-provider strategy.

Which failures should trigger fallback?

Be precise. Not every error should cascade.

Condition	Fallback?	Why
HTTP 429 / rate limit	Often yes	Transient capacity; try alternate key or provider
HTTP 503 / 502	Often yes	Provider or regional outage
Timeout / idle stream stall	Careful	May duplicate work; see idempotency below
HTTP 400 validation	Usually no	Your request is wrong; another model will likely fail too
Schema validation failure on 200	Sometimes	Retry once on same model with repair prompt; then fallback
Content policy violation	Rarely	Fallback may violate the same policy

Define fallback eligibility in configuration so operators can disable a bad secondary path without redeploying.

Quality guardrails on downgrade

When you fall back from a large model to a small one, you are making a product trade-off, not just an infrastructure one. Mitigations:

Route only eligible tasks to cheaper models (classification gate).
Attach a quality flag to responses (model_tier: "economy") so the UI can set expectations or offer “upgrade to premium.”
Block downgrade for high-stakes routes (medical, legal, billing) regardless of provider errors—queue or surface an error instead.

In production systems I have worked on, the painful incidents were silent downgrades: users did not know they received a budget answer until trust was already damaged.

Example chain configuration

type RouteClass = "interactive" | "batch";

type ModelCandidate = {
  provider: "openai" | "anthropic" | "self-hosted";
  model: string;
  maxOutputTokens: number;
  /** Estimated USD per 1M input tokens — used for budget checks only */
  inputPricePerM: number;
  outputPricePerM: number;
};

type FallbackChain = {
  routeClass: RouteClass;
  candidates: ModelCandidate[];
  /** Stop after this wall-clock budget across all attempts */
  totalDeadlineMs: number;
};

const SUPPORT_COPILOT_CHAIN: FallbackChain = {
  routeClass: "interactive",
  candidates: [
    {
      provider: "openai",
      model: "gpt-4.1-mini",
      maxOutputTokens: 1_024,
      inputPricePerM: 0.4,
      outputPricePerM: 1.6,
    },
    {
      provider: "anthropic",
      model: "claude-sonnet-4-20250514",
      maxOutputTokens: 1_024,
      inputPricePerM: 3.0,
      outputPricePerM: 15.0,
    },
    {
      provider: "self-hosted",
      model: "llama-3.3-70b-instruct",
      maxOutputTokens: 512,
      inputPricePerM: 0,
      outputPricePerM: 0,
    },
  ],
  totalDeadlineMs: 12_000,
};

Chains should be versioned and loaded from configuration (database, feature flag payload, or Git-managed YAML) so incident response can reorder candidates in minutes.

Hedged requests: when duplicate work is worth it

Hedging starts a secondary request after a delay without canceling the first—whichever completes first wins; you cancel the loser. It cuts tail latency when providers have noisy p99 behavior.

When hedging helps

Interactive UIs where p99 latency dominates satisfaction.
Providers with spiky queue times but good median performance.
Read-like completions where duplicate output is discarded (no side effects).

When hedging hurts

Tool-calling loops — doubling calls may double side effects unless tools are idempotent.
Tight budgets — you pay for two completions on hedged wins.
Rate-limited keys — hedging can trigger 429s across the board.

A conservative pattern: hedge only after p95 + small epsilon idle time, at most once per request, and only on routes marked hedgeEligible: true.

async function hedgedCompletion<T>(
  primary: () => Promise<T>,
  secondary: () => Promise<T>,
  hedgeAfterMs: number,
): Promise<T> {
  let primaryReject: (err: unknown) => void = () => {};
  const primaryDone = new Promise<T>((resolve, reject) => {
    primaryReject = reject;
  });

  const primaryRace = primary()
    .then((value) => {
      primaryReject = () => {};
      return value;
    })
    .catch((err) => {
      primaryReject(err);
      throw err;
    });

  const hedgeTimer = new Promise<"hedge">((resolve) =>
    setTimeout(() => resolve("hedge"), hedgeAfterMs),
  );

  const hedgeOrPrimary = await Promise.race([primaryRace, hedgeTimer]);

  if (hedgeOrPrimary !== "hedge") {
    return hedgeOrPrimary as T;
  }

  return Promise.race([primaryRace, secondary()]);
}

Instrument hedges: track hedge_started, hedge_won_by_secondary, and wasted_tokens so finance and engineering see the real cost.

Cost-aware selection without a race to the bottom

Cost-aware routing picks the cheapest candidate that still satisfies capability constraints—not the cheapest model globally.

Estimation before the call

Before invoking the model, estimate:

function estimateCostUsd(
  candidate: ModelCandidate,
  promptTokens: number,
  expectedOutputTokens: number,
): number {
  return (
    (promptTokens / 1_000_000) * candidate.inputPricePerM +
    (expectedOutputTokens / 1_000_000) * candidate.outputPricePerM
  );
}

Use tokenizer counts from the provider or a compatible local tokenizer. If the estimate exceeds the per-request or per-tenant budget, downshift to a cheaper candidate in the same chain or return 402 Payment Required / a product-specific quota error.

Dynamic routing from observed quality

Offline benchmarks are not enough. Log task type, model, latency, token usage, and downstream outcomes (user thumbs, support escalations, automated eval scores). Periodically promote or demote models in chains based on quality per dollar, not cost alone.

This is where teams building scalable assistant platforms separate themselves from demos: routing becomes a feedback loop, not a one-time spreadsheet.

Circuit breakers and provider drain

Reuse the same circuit breaker semantics you apply to payment gateways:

Closed — normal traffic.
Open — skip provider after error threshold; fast-fail to next candidate.
Half-open — probe with a trickle of traffic.

Track breakers per (provider, model, region) and expose metrics to dashboards. During an incident, operators should be able to drain a provider—force all traffic to skip it—via feature flag without waiting for automatic breaker thresholds.

Combine breakers with concurrency limits (bulkheads) per provider so one runaway batch job cannot exhaust your shared quota for interactive users.

Practical example: a small routing executor

The following executor ties together chains, deadlines, circuit skipping, and structured telemetry. It is intentionally boring: production routers should be easy to reason about under stress.

type CompletionRequest = {
  routeClass: RouteClass;
  messages: Array<{ role: string; content: string }>;
  tenantId: string;
  correlationId: string;
};

type CompletionResult = {
  text: string;
  provider: string;
  model: string;
  attempt: number;
  cache?: "bypass";
};

type ProviderClient = {
  complete(params: {
    model: string;
    messages: CompletionRequest["messages"];
    maxOutputTokens: number;
    signal: AbortSignal;
  }): Promise<{ text: string }>;
};

type CircuitSnapshot = {
  isOpen(provider: string, model: string): boolean;
  recordSuccess(provider: string, model: string): void;
  recordFailure(provider: string, model: string, err: unknown): void;
};

type RouterDeps = {
  chains: Record<RouteClass, FallbackChain>;
  clients: Record<string, ProviderClient>;
  circuits: CircuitSnapshot;
  now: () => number;
};

function isRetryable(err: unknown): boolean {
  if (err instanceof Error && "status" in err) {
    const status = (err as Error & { status: number }).status;
    return status === 429 || status === 502 || status === 503;
  }
  return err instanceof DOMException && err.name === "AbortError";
}

export async function routeCompletion(
  deps: RouterDeps,
  req: CompletionRequest,
): Promise<CompletionResult> {
  const chain = deps.chains[req.routeClass];
  const deadline = deps.now() + chain.totalDeadlineMs;

  for (let attempt = 0; attempt < chain.candidates.length; attempt++) {
    const candidate = chain.candidates[attempt];
    const client = deps.clients[candidate.provider];
    if (!client) continue;

    if (deps.circuits.isOpen(candidate.provider, candidate.model)) {
      continue;
    }

    const remainingMs = deadline - deps.now();
    if (remainingMs <= 0) break;

    const controller = new AbortController();
    const timer = setTimeout(() => controller.abort(), remainingMs);

    try {
      const { text } = await client.complete({
        model: candidate.model,
        messages: req.messages,
        maxOutputTokens: candidate.maxOutputTokens,
        signal: controller.signal,
      });

      deps.circuits.recordSuccess(candidate.provider, candidate.model);

      return {
        text,
        provider: candidate.provider,
        model: candidate.model,
        attempt,
      };
    } catch (err) {
      deps.circuits.recordFailure(candidate.provider, candidate.model, err);
      if (!isRetryable(err) || attempt === chain.candidates.length - 1) {
        throw err;
      }
    } finally {
      clearTimeout(timer);
    }
  }

  throw new Error("all_candidates_exhausted");
}

Wire structured logs at each attempt: correlationId, tenantId, provider, model, attempt, latencyMs, errorClass. When a provider pages you at 2 a.m., these fields are the difference between a five-minute flag flip and a two-hour log archaeology session.

Shadow traffic and staged rollouts

Before promoting a new model to position one in a chain, shadow it: execute the call asynchronously, discard the result for the user, and compare latency, token usage, and eval scores against the primary. Feature flags can route 1% of production traffic to shadow mode for a week—enough signal without betting the product.

Common mistakes and pitfalls

Fallback without idempotency on timeouts. If the first request actually completed but your client timed out, a blind retry can duplicate assistant messages, tool side effects, or billing events. Use idempotency keys on persistence and idempotent tools, or reserve fallback for pre-side-effect stages only.
Same API key, multiple “providers.” Routing to two model names on one vendor does not protect you from regional outages or account-level rate limits. True resilience needs independent quotas—separate keys, vendors, or self-hosted capacity.
Unbounded chains. Each hop adds latency and failure modes. Cap chain length (typically two or three), enforce a total deadline, and surface a clear error when exhausted.
Ignoring output contract differences. Models format tool calls and JSON differently. Fallback across vendors requires a normalization layer and schema validation—not a raw passthrough of provider payloads.
Routing decisions only in the client. Mobile and web clients must not choose models; users will tamper with requests, and you cannot enforce budget or compliance. Route on the server, always.
No kill switch. Incidents need a single flag to force a known-good model or disable LLM features entirely. Without it, you edit code while production burns.
Chasing cheapest models for every route. Cost optimization that erodes answer quality increases human support cost—often more than you saved on tokens.

Conclusion

LLM provider routing turns model choice into policy: explicit chains, measurable budgets, circuit-aware execution, and optional hedging for tail latency. The implementation is mostly disciplined software engineering—timeouts, breakers, structured logs, feature flags—not a new category of magic. The teams that ship reliable copilots and AI-backed APIs treat routing as part of the platform: versioned, observable, and changeable without a frontend deploy.

If you are designing or hardening multi-model LLM backends and want them to stay available when vendors wobble—and predictable when finance asks about spend—it pays to invest in routing early rather than bolting on a second API key after the first outage. For more on how this work fits into broader engineering practice, see About; for collaboration on production-ready systems, Contact.

Receba um e-mail quando novos artigos forem publicados. Sem spam — apenas novos posts deste blog.

Via Resend. Você pode cancelar a inscrição em qualquer e-mail.