RAG pipelines in production: chunking, retrieval, and evaluation

How to design retrieval-augmented generation for real systems: text splitting, embedding strategy, reranking, guardrails, and how to measure quality before users do.

Autor: Matheus Palma4 de abril de 20268 min de lectura

Software engineeringArtificial intelligenceArchitectureBackendMachine learningReliability

A support bot answers a pricing question using last year’s PDF—and cites a paragraph that was superseded three releases ago. The retrieval layer found something relevant-looking; the language model confidently wrapped it in fluent prose. In consulting and product work, that failure mode is familiar: RAG (retrieval-augmented generation) shifts the problem from “can the model write text?” to “did we fetch the right evidence, and can we prove it?” This article is about building RAG pipelines that hold up when documentation drifts, queries are messy, and latency budgets are tight.

You will see how chunking shapes recall, why embedding choice is only part of the story, how evaluation differs from demo-quality retrieval, and which operational choices keep a system maintainable.

What RAG is optimizing for

RAG combines two ideas:

Retrieval — given a user query, select a small set of passages from a corpus (often vector search over chunks, sometimes hybrid with keyword search).
Generation — condition a language model on those passages (and usually system instructions) to produce an answer grounded in retrieved text.

The goal is not merely factual chatbots. It is controlled use of private or frequently updated knowledge without fine-tuning the base model for every document change. The trade-off is architectural: you introduce indexing jobs, embedding APIs or local models, a vector store, and failure modes where retrieval is wrong but the generator sounds right.

Chunking: the lever everyone touches first

Chunks are the units you embed and retrieve. Their size and boundaries determine whether a hit contains enough context to answer—or so much noise that similarity scores mislead you.

Size and overlap

Small chunks (e.g. 200–400 tokens) improve precision for pinpoint facts; they risk splitting tables, code blocks, or legal clauses across boundaries.
Large chunks preserve local structure and reduce fragmentation; they dilute embeddings—similarity to a short user question may drop because the vector averages over unrelated paragraphs.

Overlap (repeating the last n characters or tokens at the start of the next chunk) reduces the chance that a critical sentence sits on a cut. It increases storage and duplicate hits; tune overlap against your document set rather than copying defaults from a tutorial.

Structure-aware splitting

Plain recursive character splitting works for prose. For Markdown, HTML, or code, splitting on headings, fenced code blocks, or function boundaries often beats fixed token windows: you keep semantic units intact and make citations easier (“Section 3.2 — Rate limits”).

In engagements focused on production-ready knowledge bases, I treat chunking as a schema decision: the same PDF ingested with naive splits versus heading-aware splits can shift answer accuracy more than swapping embedding models.

Embeddings and vector search: necessary, not sufficient

Embeddings map chunks and queries into a vector space where “closeness” approximates semantic similarity. Practical choices include:

Model family — general text embeddings versus code-specific models for repositories; multilingual models if content spans languages.
Dimensionality and cost — higher dimensions can help discrimination; they increase index size and query latency.
Normalization — many pipelines L2-normalize vectors and use cosine similarity; keep preprocessing consistent between indexing and query time.

Vector search returns candidates. It does not guarantee best candidates when:

The user’s question uses different terminology than the docs (synonymy).
The answer requires composing multiple distant sections (no single chunk scores high).
The corpus contains near-duplicate or conflicting versions (old and new pricing).

That is why mature systems add hybrid retrieval (BM25 or other lexical scores fused with vectors), metadata filters (product, version, locale), and sometimes rerankers (cross-encoder or a small model that scores query–passage pairs more accurately than a single embedding dot product).

Grounding, citations, and prompt contracts

Retrieved text should flow into the model through a clear contract:

System message: answer only from provided context; if insufficient, say so.
User message: include labeled passages ([1], [2]) so the model can cite sources.

Citations are not decoration. They are how humans audit the system and how you build automated checks (does the cited chunk actually contain the claim?). For customer-facing assistants, skipping citations makes debugging almost impossible.

Evaluation: offline metrics and golden sets

Demo retrieval “feels fine” until you measure it. A practical approach:

Curate a golden set — realistic questions with expected supporting chunk ids or passage spans (dozens to hundreds for a serious iteration, more for high-stakes domains).
Retrieval metrics — recall@k (whether any of the top k chunks is correct), mean reciprocal rank, and nDCG if you have graded relevance.
End-to-end metrics — exact match or LLM-as-judge scores on final answers with citations disabled for the judge to reduce bias; or human rubrics for tone and safety.

Without a golden set, you optimize vibes. In freelance and team settings, the cost of a small labeled evaluation set is usually smaller than a week of reactive firefighting after a bad deploy.

Operational concerns: freshness, PII, and cost

Freshness — reindex when sources change; version chunks so you can purge stale embeddings. Tie ingestion to CI or CMS webhooks where possible.
PII — scrub or block sensitive fields before embedding and storage; retrieval can leak what you indexed.
Cost — batch embedding jobs, cache query embeddings for repeated questions, and cap topK and context window sent to the generator.

Practical example: minimal indexing and query flow

The following TypeScript sketches a batch indexing path and a query path. It omits vendor-specific SDK details; substitute your vector store client and embedding API.

// --- types ---
export type Chunk = {
  id: string;
  docId: string;
  text: string;
  metadata: Record<string, string | number | boolean>;
};

export type IndexedChunk = Chunk & {
  embedding: number[];
};

// --- chunking (simplified: fixed windows; prefer structure-aware in production) ---
export function chunkText(docId: string, body: string, maxChars = 1200, overlap = 200): Chunk[] {
  const chunks: Chunk[] = [];
  let i = 0;
  let part = 0;
  while (i < body.length) {
    const end = Math.min(i + maxChars, body.length);
    const text = body.slice(i, end).trim();
    if (text.length > 0) {
      chunks.push({
        id: `${docId}#${part}`,
        docId,
        text,
        metadata: { part },
      });
      part += 1;
    }
    if (end >= body.length) break;
    i = end - overlap;
  }
  return chunks;
}

function cosineSimilarity(a: number[], b: number[]): number {
  let dot = 0;
  let na = 0;
  let nb = 0;
  for (let i = 0; i < a.length; i++) {
    dot += a[i] * b[i];
    na += a[i] * a[i];
    nb += b[i] * b[i];
  }
  return dot / (Math.sqrt(na) * Math.sqrt(nb) + 1e-12);
}

// --- query ---
export async function answerQuestion(
  question: string,
  index: IndexedChunk[],
  embedQuery: (q: string) => Promise<number[]>,
  generate: (prompt: string) => Promise<string>,
  topK = 5,
): Promise<{ answer: string; citations: string[] }> {
  const qv = await embedQuery(question);
  const scored = index
    .map((c) => ({ c, score: cosineSimilarity(qv, c.embedding) }))
    .sort((a, b) => b.score - a.score)
    .slice(0, topK);

  const contextBlock = scored
    .map(({ c }, i) => `[${i + 1}] (doc ${c.docId})\n${c.text}`)
    .join("\n\n");

  const prompt = [
    "You are a careful assistant. Use ONLY the numbered passages below.",
    "If the passages do not contain the answer, say you cannot find it in the documentation.",
    "Cite passage numbers like [1], [2] when you use them.",
    "",
    "Passages:",
    contextBlock,
    "",
    `Question: ${question}`,
  ].join("\n");

  const answer = await generate(prompt);
  const citations = scored.map(({ c }) => c.id);
  return { answer, citations };
}

This example is intentionally minimal: production code adds hybrid search, reranking, token budgeting, redaction, and structured logging with trace ids—patterns that belong in any scalable deployment.

Common mistakes and pitfalls

Trusting vector similarity alone. High score ≠ correct answer; validate with hybrid retrieval, reranking, or metadata filters.

Chunks that break structure. Tables and APIs split mid-row produce misleading embeddings; use structure-aware splitting for those formats.

No evaluation set. You will ship regressions when changing models or chunk sizes without measuring recall@k and answer quality.

Unbounded context. Stuffing the top 20 chunks into the prompt increases latency, cost, and hallucination risk; constrain tokens and enforce citations.

Stale indexes. Documentation updates without reindexing are a silent source of wrong answers; automate ingestion and monitor index age per document.

Ignoring permissions. Retrieval over mixed access levels can surface text the user should not see; filter by tenant or ACL before generation.

Conclusion

RAG moves complexity from the model weights into data engineering: chunking, indexing, retrieval quality, and evaluation. The systems that survive contact with real users are those that treat retrieval as a ranking and governance problem, not a single similaritySearch call.

Key takeaways:

Design chunks around semantic units in your source formats; overlap to reduce bad cuts.
Combine vector search with lexical or metadata signals when terminology or versioning matters.
Ground answers with citations and measure quality with a golden set before and after changes.
Operate with fresh indexes, PII discipline, and clear prompt contracts.

Teams building production-ready assistants and internal search benefit from investing in evaluation and ingestion early—before the model’s fluency masks retrieval gaps. For architecture questions or collaboration, the contact page is the right place to reach out.

Recibe un correo cuando se publiquen artículos nuevos. Sin spam — solo entradas nuevas de este blog.

Con Resend. Puedes darte de baja en cualquier correo.