Context engineering for production LLM applications

How to assemble, budget, and evaluate the information you send to models. Layered context, truncation policies, versioning, and regression tests for reliable LLM backends.

Autor: Matheus Palma10 min de lectura
Artificial intelligenceSoftware engineeringBackendAPI designTypeScriptRAG

Your retrieval pipeline returns the right documents. Your system prompt is carefully written. Users still get wrong answers because the assembled context—the exact bytes the model sees—is wrong in subtle ways: stale tool output buried under chat history, a truncated JSON schema, or a RAG chunk that pushed the user's question out of the window. In production LLM backends, context engineering is the discipline of deciding what enters the model, in what order, and under what budget, every time you call inference.

This article treats context as a first-class API surface: versioned, measured, and tested like any other contract. The patterns show up constantly when hardening assistants, copilots, and agentic workflows for clients who need predictable behavior under load—not demos that work until the conversation gets long.

Context engineering vs prompt engineering

Prompt engineering focuses on wording: instructions, few-shot examples, tone, and output format. Context engineering focuses on assembly: which slices of information are included, how they are ordered, how much space each slice receives, and how the bundle changes across turns, tenants, and product versions.

ConcernPrompt engineeringContext engineering
Primary artifactSystem message textFull request payload (system + tools + RAG + history + metadata)
Failure modeModel ignores or misreads instructionsModel never sees the evidence or tool result it needed
TestingSubjective quality reviewToken budgets, regression suites, diffable context snapshots
Versioningprompt_v3.md in a repocontext_policy_id tied to deployments and eval baselines

Both matter. A perfect system prompt cannot compensate for missing retrieval results or a history trim that dropped the user's constraint from three turns ago.

The context budget: tokens are a hard constraint

Every model has a context window—input plus output capacity. Production systems rarely use the theoretical maximum:

  • Latency and cost scale with input tokens; long contexts inflate both.
  • Attention quality degrades on some models when critical facts sit in the middle of very long prompts ("lost in the middle" effects).
  • Output headroom must be reserved for tool calls, JSON, or streaming completions.

Define an explicit context budget per route:

total_window     = 128_000   # model limit
reserved_output  = 4_096     # max completion + tool JSON
safety_margin    = 512       # tokenizer drift, special tokens
available_input  = total_window - reserved_output - safety_margin

Allocate available_input across layers (see below). When a layer exceeds its allocation, apply a truncation policy—never silently drop the user message or active tool definitions.

Layered allocation (example policy)

LayerShare of input budgetTruncation strategy
System + policyFixed cap (e.g. 2–4k tokens)Fail deploy if over cap; do not truncate at runtime
Tool schemasFixed capDrop optional tools by priority list; never drop required tools mid-flight
RAG / knowledgeFlexible (largest slice)Rerank, then trim chunks by score; prefer diverse sources
Session historyFlexibleSummarize old turns; keep recent verbatim
Current user messageMinimum guaranteeNever truncate; reject request if it alone exceeds budget

Treat these numbers as product configuration, not constants in code without observability.

Designing context layers

Think of each request as a stack of typed blocks. The model reads top to bottom; order affects salience.

System and policy block

Instructions, safety rules, locale, and output format. Keep this block stable within a prompt_version. Store the version on the session or trace so you can correlate quality regressions with deploys.

Tool and schema block

Function definitions, JSON schemas, and tool-choice policy. Large tool sets are a common cause of silent budget overrun. Mitigations:

  • Route-specific tool subsets — a billing route does not need calendar tools.
  • Schema compression — shorter property descriptions; $ref consolidation where the provider allows.
  • Lazy tool exposure — expose a "search tools" meta-tool only when the planner needs breadth.

Retrieval block

Chunks, citations, and structured facts from search or SQL. Tie retrieval to corpus version metadata (see RAG pipelines for chunking and evaluation). In context engineering, the extra requirement is provenance in the prompt: chunk ids, source URLs, or doc_version strings so the model (and your eval harness) can reason about freshness.

Session history block

Prior user and assistant turns, tool calls, and tool results. This layer grows fastest. Strategies:

  • Sliding window — keep the last N turns verbatim; cheap but loses early constraints.
  • Summarization — compress older turns into a rolling summary stored on the session; cheaper tokens, risk of summary drift.
  • Structured state — extract slots (destination, budget, ticket_id) into a JSON "working memory" block instead of replaying full prose.

For multi-turn backends, session modeling is its own topic; the key context-engineering lesson is never let history crowd out retrieval or the current user message.

Ephemeral runtime block

Live data: current UTC time, feature flags, user tier, request id. Small but easy to forget in regression tests—snapshot it in eval fixtures.

Assembly pipeline: deterministic, testable, observable

Avoid ad-hoc string concatenation in route handlers. Use a context builder that:

  1. Loads configuration (context_policy_id, model, tenant).
  2. Fetches each layer from its source (DB, vector index, session store).
  3. Measures token counts per layer (use the same tokenizer family as the provider when possible).
  4. Applies truncation policies in a fixed order (trim RAG before history, never system).
  5. Emits a structured trace: layer sizes, dropped chunk ids, prompt_version, kb_snapshot_id.

Log or export the trace to your observability stack. When users report "it forgot what I said," the trace shows whether history was summarized, RAG was empty, or tools were dropped.

Practical example: a budget-aware context builder

The following TypeScript sketch shows layered assembly with explicit budgets and a hard guarantee for the user message. Replace countTokens and fetchChunks with your stack; the control flow is what matters.

type ContextLayer =
  | { kind: "system"; text: string; tokens: number }
  | { kind: "tools"; json: string; tokens: number }
  | { kind: "rag"; chunks: { id: string; text: string; tokens: number }[] }
  | { kind: "history"; messages: { role: string; content: string; tokens: number }[] }
  | { kind: "user"; text: string; tokens: number };

type BuildResult = {
  messages: { role: "system" | "user" | "assistant"; content: string }[];
  trace: {
    policyId: string;
    droppedChunkIds: string[];
    summarizedHistory: boolean;
    totalInputTokens: number;
  };
};

export async function buildContext(params: {
  policyId: string;
  systemPrompt: string;
  toolSchemasJson: string;
  query: string;
  sessionId: string;
  maxInputTokens: number;
  minUserTokens: number;
  ragBudget: number;
  historyBudget: number;
  countTokens: (text: string) => number;
  fetchChunks: (query: string) => Promise<{ id: string; text: string }[]>;
  loadHistory: (sessionId: string) => Promise<{ role: string; content: string }[]>;
}): Promise<BuildResult> {
  const userTokens = params.countTokens(params.query);
  if (userTokens > params.minUserTokens) {
    throw new Error("USER_MESSAGE_EXCEEDS_BUDGET");
  }

  const systemTokens = params.countTokens(params.systemPrompt);
  const toolTokens = params.countTokens(params.toolSchemasJson);
  const fixedOverhead = systemTokens + toolTokens + userTokens;

  if (fixedOverhead > params.maxInputTokens) {
    throw new Error("FIXED_LAYERS_EXCEED_BUDGET");
  }

  let remaining = params.maxInputTokens - fixedOverhead;
  const ragBudget = Math.min(params.ragBudget, remaining);
  remaining -= ragBudget;

  const historyBudget = Math.min(params.historyBudget, remaining);

  const rawChunks = await params.fetchChunks(params.query);
  const chunksWithTokens = rawChunks.map((c) => ({
    ...c,
    tokens: params.countTokens(c.text),
  }));

  const { kept, droppedIds } = fitChunksByScore(chunksWithTokens, ragBudget);

  const history = await params.loadHistory(params.sessionId);
  const { messages: historyMessages, summarized } = fitHistory(
    history,
    historyBudget,
    params.countTokens,
  );

  const ragBlock = kept.length
    ? `## Retrieved context\n${kept.map((c) => `[${c.id}]\n${c.text}`).join("\n\n")}`
  : "";

  const messages: BuildResult["messages"] = [
    { role: "system", content: params.systemPrompt },
    ...(ragBlock ? [{ role: "system" as const, content: ragBlock }] : []),
    ...historyMessages,
    { role: "user", content: params.query },
  ];

  const totalInputTokens =
    systemTokens +
    toolTokens +
    kept.reduce((s, c) => s + c.tokens, 0) +
    historyMessages.reduce((s, m) => s + params.countTokens(m.content), 0) +
    userTokens;

  return {
    messages,
    trace: {
      policyId: params.policyId,
      droppedChunkIds: droppedIds,
      summarizedHistory: summarized,
      totalInputTokens,
    },
  };
}

function fitChunksByScore(
  chunks: { id: string; text: string; tokens: number }[],
  budget: number,
): { kept: typeof chunks; droppedIds: string[] } {
  const sorted = [...chunks].sort((a, b) => b.tokens - a.tokens); // replace with reranker score
  const kept: typeof chunks = [];
  const droppedIds: string[] = [];
  let used = 0;
  for (const c of sorted) {
    if (used + c.tokens <= budget) {
      kept.push(c);
      used += c.tokens;
    } else {
      droppedIds.push(c.id);
    }
  }
  return { kept, droppedIds };
}

function fitHistory(
  history: { role: string; content: string }[],
  budget: number,
  countTokens: (t: string) => number,
): {
  messages: { role: "user" | "assistant"; content: string }[];
  summarized: boolean;
} {
  const out: { role: "user" | "assistant"; content: string }[] = [];
  let used = 0;
  for (let i = history.length - 1; i >= 0; i--) {
    const m = history[i];
    const t = countTokens(m.content);
    if (used + t > budget) {
      return {
        messages: [
          { role: "assistant", content: "[Earlier conversation summarized: user goals unchanged.]" },
          ...out,
        ],
        summarized: true,
      };
    }
    out.unshift({ role: m.role as "user" | "assistant", content: m.content });
    used += t;
  }
  return { messages: out, summarized: false };
}

Production code adds: reranking before fitChunksByScore, tenant isolation on retrieval, redaction, and persistence of trace on the inference span.

Evaluation: context changes are breaking changes

When you change retrieval top-K, history summarization, or system prompt length, you change the input distribution to the model. Treat that like an API breaking change:

  • Golden sessions — fixed multi-turn transcripts with expected tool calls and citation behavior.
  • Context snapshots — store hashed assembled prompts (PII-scrubbed) for diff review in CI.
  • Layer metrics — alert when median droppedChunkIds.length spikes after a deploy.
  • Offline replay — re-run the same user queries against old and new context_policy_id; compare answer faithfulness with an LLM-as-judge rubric or human spot checks.

Teams I work with on production assistants often skip this step and blame the model when the real regression was RAG budget cut from 8k to 2k to save cost.

Interaction with caching and streaming

Context engineering intersects semantic caching: cache keys must fingerprint the effective context, not only the user string. Two requests with identical questions but different kb_snapshot_id or prompt_version must not share completions.

For streaming routes, assemble context before opening the stream so token counts and traces are complete in logs even if the client disconnects mid-generation.

Common mistakes and pitfalls

  • Unbounded chat history appended until the provider returns a context-length error. Fail gracefully with a structured error and optional "start new session" path.
  • RAG without budget — dumping twenty chunks because the reranker returned them; the model attends poorly and cost explodes.
  • Tool schema bloat — dozens of tools "just in case," leaving no room for evidence.
  • Non-deterministic assembly — parallel fetches that reorder chunks between requests; hurts reproducibility and caching.
  • No prompt_version on traces — impossible to correlate incidents with deploys.
  • Truncating the user message to fit history — never acceptable; reject or summarize history instead.
  • Ignoring tokenizer mismatch — local countTokens underestimates vs provider billing; leave margin or call the provider's token API in CI.

Conclusion

Production LLM applications succeed or fail on what the model actually sees. Context engineering turns that bundle into an explicit, budgeted, versioned contract: layered assembly, truncation policies with clear priority, and evaluation that treats context policy changes as first-class releases. Prompt wording still matters, but it sits inside a larger system you can measure and harden.

If you are building assistants, agent backends, or RAG products that must stay correct as sessions grow and corpora change, designing the context pipeline early avoids expensive retrofitting. For background on related engineering work, see About; for collaboration on scalable, production-ready systems, Contact.

Suscríbete al boletín

Recibe un correo cuando se publiquen artículos nuevos. Sin spam — solo entradas nuevas de este blog.

Con Resend. Puedes darte de baja en cualquier correo.