Semantic caching for LLM APIs: cost, latency, and correctness

Reuse model outputs when prompts are paraphrases, not byte-identical strings. Embedding-based cache keys, TTLs, invalidation, and guardrails for production LLM backends.

Author: Matheus Palma7 min read
Software engineeringArtificial intelligenceBackendAPI designRedisEmbeddings

You roll out a support copilot. Traffic doubles; so does your invoice. Worse, p95 latency climbs because every question—however similar to one asked five minutes ago—hits the model again. A classic string-keyed cache (hash(prompt)) barely helps: users rephrase, add typos, or paste different whitespace, so the cache miss rate stays high. Semantic caching treats “close enough” prompts as the same logical query, trading a bit of engineering complexity for predictable cost and faster responses.

This article explains how semantic caches work, where they fit in an LLM stack, and the correctness traps that turn a cost optimization into a privacy or compliance incident. The patterns come up often in production APIs and in consulting work on RAG-style assistants: the goal is not clever math for its own sake, but controlled reuse of expensive, slow inference.

What semantic caching is (and is not)

Semantic caching stores (key → completion) entries where the key is derived from the meaning of the prompt (and optionally structured context), not from the raw text bytes.

  • Not deduplication of identical HTTP bodies—that is ordinary HTTP or application caching.
  • Not a replacement for RAG retrieval: retrieval grounds answers in fresh documents; caching reuses prior model outputs (or distilled summaries) when inputs are semantically near prior ones.
  • Not free: you pay for embedding calls, storage, and the operational work of tuning similarity thresholds and invalidation.

The core loop:

  1. Normalize the request (strip boilerplate, canonicalize locale, redact secrets).
  2. Embed the normalized prompt (and relevant metadata) into a vector.
  3. Search a vector index (often Redis with vector search, pgvector, Pinecone, etc.) for neighbors above a similarity threshold.
  4. On hit, return the cached completion (optionally with a quick consistency check).
  5. On miss, call the model, then insert the new vector and completion.

Steps 3–5 are where latency, cost, and correctness trade-offs live.

Designing the cache key: what goes into the embedding?

The embedding should represent everything that must match for a cached answer to remain valid. Common ingredients:

User-visible prompt text

Embed the user message after light normalization: collapse repeated whitespace, trim, maybe lowercasing for case-insensitive locales. Avoid aggressive rewriting that changes meaning.

System instructions and tools

If your route concatenates a system prompt or tool definitions into the effective prompt, those belong in the key. Two requests with the same user text but different system instructions are not interchangeable.

Retrieved context (RAG)

When answers depend on retrieved chunks, the cache key should reflect which documents and versions influenced the answer—not necessarily every token of the chunks.

Practical approaches:

  • Hash of chunk IDs + content versions concatenated with the user embedding or stored as metadata filters.
  • Cache at the retrieval layer separately (embedding of query → chunk set) and cache at the completion layer (prompt + chunk hashes → answer). Layered caching is easier to reason about than one giant blob.

Model and decoding parameters

Include model name, temperature, top_p, max_tokens (or buckets), and response format (JSON schema id, tool choice). A greedy answer and a creative one are different products even for the “same” prompt.

Tenant and authorization scope

For multi-tenant SaaS, tenant id (and sometimes role or data partition) must be part of the lookup constraints. Never let tenant A receive a completion cached from tenant B’s semantically similar prompt—that is both a security defect and a reputational risk.

Similarity thresholds: precision, recall, and user harm

Choosing a cosine similarity cutoff is not a one-time constant; it is a product decision.

  • Too loose → wrong answers served from cache; users see confident nonsense or outdated policy.
  • Too tight → near-zero hit rate; you pay embedding and index costs without savings.

Treat the threshold like a classifier: measure offline with labeled pairs (prompt_a, prompt_b, should_match) sampled from real logs (with PII scrubbed). In production, shadow mode helps: compute would-hit / would-miss without serving cached text, log outcomes, and compare to fresh completions until you trust the curve.

When consulting teams ship this for the first time, the failure mode is almost always over-broad matching on short prompts—“What is the refund policy?” across different SKUs, regions, or contract tiers needs metadata gates, not just vector proximity.

TTL, invalidation, and “stale but plausible”

LLM outputs go stale when:

  • Ground truth changes (pricing, legal text, product behavior).
  • Your RAG corpus updates without bumping version metadata.
  • Model upgrades change phrasing or factual tendencies even when prompts match.

Mitigations:

  • Time-to-live (TTL) per route: short for policy-heavy bots, longer for purely stylistic or creative tasks.
  • Version stamps in the key: policy_doc_version, kb_snapshot_id, model_revision.
  • Explicit busting when editors publish: webhook → delete keys by tag or namespace.

Semantic cache hits feel more dangerous than HTTP cache hits because the text reads fluent. Pair aggressive TTLs with telemetry: track hit rate, downstream correction rate (human thumbs-down, support tickets), and drift checks (optional re-embed of stored prompts vs. current policy hash).

Where to place the cache in the request path

Two common architectures:

Edge or API gateway (before retrieval)

Useful when many requests are duplicates before you spend on retrieval and reranking. Risk: you may cache without grounding if your key omits corpus version—avoid that for RAG-heavy flows.

After retrieval, before the LLM

The key includes retrieved evidence fingerprints. You only skip the model when both the question and the evidence set match prior work. This preserves grounding alignment better than caching solely on the user question.

After the LLM (response cache)

Simplest mentally: cache final assistant text. Still requires the full effective-input fingerprint above.

For streaming endpoints, decide whether the cache returns one shot (buffered replay) or re-simulated SSE chunks; replay is simpler and preserves byte-identical transcripts for auditing.

Practical example: Redis-style flow with explicit namespaces

The following sketch shows a namespace per tenant and model, a similarity search, and a metadata guard for a kb_version. It is illustrative—not a drop-in library.

import { createHash } from "crypto";

type CacheEntry = {
  embedding: number[];
  completion: string;
  kbVersion: string;
  createdAt: number;
};

function hashChunks(ids: string[]): string {
  return createHash("sha256").update(ids.sort().join("|")).digest("hex");
}

/**
 * Pseudo-code: replace embed(), vectorSearch(), and Redis calls with your stack.
 */
export async function getCachedOrGenerate(params: {
  tenantId: string;
  model: string;
  userPrompt: string;
  systemPrompt: string;
  kbVersion: string;
  chunkIds: string[];
  embed: (text: string) => Promise<number[]>;
  vectorSearch: (ns: string, vector: number[], k: number) => Promise<
    { score: number; entry: CacheEntry }[]
  >;
  generate: () => Promise<string>;
  similarityMin: number;
}): Promise<{ text: string; cache: "hit" | "miss" }> {
  const chunkFp = hashChunks(params.chunkIds);
  const normalizedUser = params.userPrompt.replace(/\s+/g, " ").trim();
  const keyMaterial = `${params.systemPrompt}\n---\n${normalizedUser}`;
  const vector = await params.embed(keyMaterial);

  const ns = `llm:${params.tenantId}:${params.model}:${chunkFp}`;
  const neighbors = await params.vectorSearch(ns, vector, 3);
  const best = neighbors[0];

  if (
    best &&
    best.score >= params.similarityMin &&
    best.entry.kbVersion === params.kbVersion
  ) {
    return { text: best.entry.completion, cache: "hit" };
  }

  const text = await params.generate();
  await persistEntry(ns, {
    embedding: vector,
    completion: text,
    kbVersion: params.kbVersion,
    createdAt: Date.now(),
  });

  return { text, cache: "miss" };
}

declare function persistEntry(ns: string, entry: CacheEntry): Promise<void>;

Notice the triple gate: similarity score, tenant namespace, and kbVersion. Skipping any of them is how subtle bugs reach production.

Common mistakes and pitfalls

  • Caching personally identifiable or regulated content without retention policies. A semantic hit can amplify leakage if prompts echo user data in completions. Prefer not to cache routes that handle sensitive payloads, or store redacted completions only.
  • Ignoring tool calls and function results. If the model invoked a payment API last time, a cached replay must not replay side effects. Cache final assistant-visible text only after external effects are idempotent or moved out of the cached path.
  • Threshold tuning on short prompts. Brief questions sit close together in embedding space; require longer prompts, clarifying questions, or metadata filters for high-stakes domains.
  • No observability. Without hit rate, latency percentiles, and “cache corrected” signals, you cannot tell whether the cache saves money or erodes trust.
  • Stale RAG: embedding the user question alone while documents drift yields confident wrong answers. Tie cache entries to corpus versions.

Conclusion

Semantic caching is a production control for LLM-backed APIs: it reduces cost and tail latency when prompts cluster in meaning space, but only if you treat it like any other distributed cache with security and freshness requirements. Invest in key design (tenant, model, tools, corpus version), threshold validation, and TTL discipline—the same ingredients you would expect when hardening a multi-tenant data layer.

If you are building or hardening LLM features and want them to behave reliably under load, it helps to design caching, retrieval, and observability together rather than bolting on embeddings after launch. For background on how this site approaches engineering work, see About; for collaboration or inquiries, Contact.

Subscribe to the newsletter

Get an email when new articles are published. No spam — only new posts from this blog.

Powered by Resend. You can unsubscribe from any email.