Hybrid search in production RAG: BM25, vectors, and rank fusion that actually ships

Pure vector search misses exact tokens; pure keyword search misses paraphrases. Design hybrid retrieval with BM25, embeddings, metadata filters, and reciprocal rank fusion for production RAG.

Autor: Matheus Palma14 de junho de 202610 min de leitura

Software engineeringArtificial intelligenceBackendRAGPostgreSQLTypeScriptSearch

A support engineer searches your internal docs for ERR_PAYMENT_4029. The vector index returns three paragraphs about “billing issues” and “payment failures,” all plausible and well written—and all wrong for that ticket. The exact error code lives in a runbook chunk that never surfaces because the user’s wording and the document’s wording diverged just enough in embedding space.

Flip the stack to keyword-only search and you fix the code lookup, then lose on paraphrases: “How do I cancel my subscription?” misses a page titled “Ending your plan.” Production RAG systems live in that gap. Hybrid search runs lexical and semantic retrieval in parallel, then fuses ranked lists so you keep precision on identifiers and recall on natural language—without pretending one scoring function fits every query.

This article explains how to design hybrid retrieval for real backends: when each leg helps, how to fuse scores without fooling yourself in evaluation, and where metadata and reranking belong. The patterns show up constantly in consulting work on knowledge bases and copilots; the goal is retrieval you can reason about under load, not a demo that looks good on five hand-picked questions.

Why vector-only retrieval fails in production

Dense retrieval (embeddings + approximate nearest neighbors) is strong when:

Users paraphrase instead of quoting documentation titles.
Concepts are described with different vocabulary (“auto-renew” vs “subscription renewal”).
You need cross-lingual or lightly normalized matching.

It is weak when:

Exact tokens matter: error codes, SKUs, legal clause numbers, config keys, UUID fragments.
Rare strings dominate intent: a six-character product code has little semantic signal in a 768-dimensional vector.
Corpus skew pushes generic chunks to the top—every “payment” article looks similar.
Chunking split the critical sentence across boundaries, so neither chunk embeds the full fact.

These are not edge cases in enterprise knowledge bases. They are the default when docs mix narrative prose with operational tables, JSON samples, and incident runbooks.

Lexical search—BM25 over inverted indexes, PostgreSQL tsvector, OpenSearch match, etc.—excels at token overlap and rare-term weighting. It struggles with synonymy, typos beyond edit distance, and conceptual questions that never share tokens with the source.

Hybrid retrieval accepts that no single representation of “relevance” is sufficient and combines orthogonal signals at ranking time.

Architecture: two indexes, one query plan

A practical hybrid stack has three layers:

User query
    │
    ├─► Lexical leg (BM25 / full-text) ──► ranked list A
    │
    ├─► Vector leg (embedding ANN)     ──► ranked list B
    │
    └─► Metadata filters (tenant, product, version, ACL)
              │
              ▼
        Rank fusion (RRF, weighted, or learned)
              │
              ▼
        Optional reranker (cross-encoder)
              │
              ▼
        topK chunks → LLM context

Keep indexes aligned on the same chunk identity

Both legs must return the same chunk primary key (chunk_id). If lexical search indexes whole pages but vectors index 512-token windows, fusion compares unrelated units and silently degrades quality.

In migrations and reindex jobs, treat chunk_id, document_version, and content_hash as a versioned triple. When a document changes, invalidate or supersede chunks atomically in both indexes.

Metadata filters before fusion, not after

For multi-tenant SaaS, apply tenant id, visibility, and product/version constraints in both legs before fusion. Filtering after fusion wastes work and can leak chunks across boundaries if one index lags during rollout.

PostgreSQL with pgvector often implements this as WHERE tenant_id = $1 AND product = $2 on a unified table with both tsvector and embedding columns, or as parallel queries merged in application code.

Lexical leg: BM25 and full-text in practice

BM25 (Best Matching 25) scores documents by term frequency with saturation and length normalization—so repeating “payment” ten times does not linearly dominate, and short runbooks are not permanently penalized against long glossaries.

If you already run PostgreSQL for your app, tsvector + ts_rank_cd (or extensions like pg_bm25 where available) avoids operating a second search cluster for modest corpora. For larger scale, OpenSearch/Elasticsearch BM25 with analyzers tuned per locale is common.

Lexical tuning that matters in production:

Analyzers: English stemmer helps “refund/refunded”; do not stem SKU fields. Use multi-fields (stemmed + keyword) for mixed content.
Stop words: removing “the” helps prose; keep them for code-like queries or offer a raw search path when the query matches /[A-Z0-9_]{4,}/.
Boosts: title and heading fields weighted above body text often improve FAQ-style corpora without touching vectors.

Detect query mode early. Short alphanumeric-heavy strings route lexical-first; conversational questions route balanced hybrid. A simple heuristic—token count, regex on error-code shapes, presence of quoted strings—beats a monolithic fusion weight for every request.

Vector leg: embeddings without false confidence

The vector leg should use the same embedding model in indexing and query time, with documented model version in chunk metadata. Silent upgrades (new model, same column) produce incomparable vectors until you reindex.

Operational choices:

Normalize embeddings when your distance metric expects unit vectors (cosine via dot product).
ANN parameters (ef_search, lists in IVFFlat, HNSW ef_construction) trade recall for latency. Hybrid search tolerates slightly lower ANN recall on the vector leg because BM25 rescues many misses—measure end-to-end, not per-leg recall@K in isolation.
Cache query embeddings for repeated questions; they are pure functions of (model_version, normalized_query).

Rank fusion: combining lists without comparable scores

BM25 scores and cosine similarity live on incommensurate scales. You cannot add 0.82 cosine to 14.7 BM25 and expect stable rankings across queries.

Three fusion approaches appear repeatedly in production:

Reciprocal Rank Fusion (RRF)

RRF ignores raw scores and uses ranks only. For each document d, sum contributions from every retrieval leg:

RRF(d) = sum over legs i of: 1 / (k + rank_i(d))

with constant k (often 60). Documents that appear high in either list rise; documents both legs agree on rise further.

RRF is robust, cheap, and needs no score calibration—ideal when teams lack labeled data for learning-to-rank. The main knob is k: higher k dampens head-heavy dominance from one leg.

Weighted normalized scores

Min-max or z-score normalize each leg’s top-N scores per query, then combine α * lexical + (1-α) * vector. This works when you have offline labels to tune α per route or query class. It fails when score distributions shift after analyzer changes—retune or fall back to RRF.

Learned fusion (LTR)

When you log clicks, thumbs-up/down, or downstream answer correctness, a small model on features (lexical_rank, vector_rank, chunk_age, click_rate) can outperform hand weights. Cost: labeling pipeline, drift monitoring, and explainability for compliance-sensitive domains.

For most teams shipping v1 hybrid retrieval, start with RRF, log ranks from both legs, and only move to weighted or learned fusion when evaluation sets justify the complexity.

Evaluation: measure retrieval, not vibes

Hybrid search fails quietly. Answers look fluent while citations are wrong.

Build a golden set of (query, expected_chunk_ids) from:

Support ticket exports (scrub PII).
Search logs with human adjudication on the top miss.
Regression cases for every production incident where retrieval failed.

Report recall@K and MRR (mean reciprocal rank) after fusion, not per leg. Track slices:

Identifier-heavy queries (codes, IDs).
Paraphrase questions.
Long multi-intent questions.

When consulting teams adopt hybrid search, the recurring mistake is tuning on 30 demo questions while production traffic is bimodal. Split metrics by query class or you will over-weight the vector leg and reintroduce the error-code failure mode.

Optional reranking: when fusion is not enough

Cross-encoder rerankers score (query, passage) pairs jointly and are more accurate than bi-encoder dot products. They are also orders of magnitude slower.

Typical pattern: hybrid fusion → top 20–50 → rerank → top 5 to the LLM. Cap reranker batch size and set timeouts; on timeout, fall back to fused order rather than failing the request.

Reranking pays off when:

Chunks are long and noisy.
Stakes are high (medical, financial, internal security policies).
You already pay for a strong generator—cheap retrieval errors waste expensive tokens downstream.

Practical example: parallel retrieval and RRF in TypeScript

The following example uses PostgreSQL-style functions as boundaries; swap in OpenSearch or a managed vector DB while keeping the same fusion layer.

type ChunkHit = {
  chunkId: string;
  rank: number; // 1-based per leg
};

type FusedHit = {
  chunkId: string;
  score: number;
};

const RRF_K = 60;

export function reciprocalRankFusion(
  lists: ChunkHit[][]
): FusedHit[] {
  const scores = new Map<string, number>();

  for (const list of lists) {
    for (const { chunkId, rank } of list) {
      const inc = 1 / (RRF_K + rank);
      scores.set(chunkId, (scores.get(chunkId) ?? 0) + inc);
    }
  }

  return [...scores.entries()]
    .map(([chunkId, score]) => ({ chunkId, score }))
    .sort((a, b) => b.score - a.score);
}

export async function hybridRetrieve(params: {
  tenantId: string;
  product: string;
  query: string;
  embed: (text: string) => Promise<number[]>;
  lexicalSearch: (q: {
    tenantId: string;
    product: string;
    query: string;
    limit: number;
  }) => Promise<string[]>;
  vectorSearch: (q: {
    tenantId: string;
    product: string;
    vector: number[];
    limit: number;
  }) => Promise<string[]>;
  limitPerLeg?: number;
  finalLimit?: number;
}): Promise<FusedHit[]> {
  const limitPerLeg = params.limitPerLeg ?? 30;
  const finalLimit = params.finalLimit ?? 8;

  const vector = await params.embed(params.query);

  const [lexicalIds, vectorIds] = await Promise.all([
    params.lexicalSearch({
      tenantId: params.tenantId,
      product: params.product,
      query: params.query,
      limit: limitPerLeg,
    }),
    params.vectorSearch({
      tenantId: params.tenantId,
      product: params.product,
      vector,
      limit: limitPerLeg,
    }),
  ]);

  const lexicalHits: ChunkHit[] = lexicalIds.map((chunkId, i) => ({
    chunkId,
    rank: i + 1,
  }));
  const vectorHits: ChunkHit[] = vectorIds.map((chunkId, i) => ({
    chunkId,
    rank: i + 1,
  }));

  return reciprocalRankFusion([lexicalHits, vectorHits]).slice(0, finalLimit);
}

/** Route identifier-heavy queries toward lexical-first limits. */
export function pickLegLimits(query: string): {
  lexical: number;
  vector: number;
} {
  const looksLikeCode = /\b[A-Z]{2,}_[A-Z0-9_]+\b/.test(query);
  const hasQuoted = /"[^"]+"|'[^']+'/.test(query);
  if (looksLikeCode || hasQuoted) {
    return { lexical: 40, vector: 15 };
  }
  return { lexical: 20, vector: 30 };
}

Example lexical SQL shape (simplified):

SELECT chunk_id
FROM knowledge_chunks
WHERE tenant_id = $1
  AND product = $2
  AND search_vector @@ websearch_to_tsquery('english', $3)
ORDER BY ts_rank_cd(search_vector, websearch_to_tsquery('english', $3)) DESC
LIMIT $4;

Example vector SQL shape with pgvector:

SELECT chunk_id
FROM knowledge_chunks
WHERE tenant_id = $1
  AND product = $2
ORDER BY embedding <=> $3::vector
LIMIT $4;

Wire pickLegLimits into limitPerLeg per request. Log both raw lists and fused order with a trace_id so you can replay bad answers without guessing which leg failed.

Common mistakes and pitfalls

Fusing incompatible chunk granularities. Page-level BM25 fused with paragraph-level vectors duplicates content and rewards boilerplate. Unify chunking first.
Ignoring index lag. If lexical reindex completes before vector reindex (or vice versa), fusion surfaces orphan or stale chunks. Version documents and block queries on mismatched index generations.
One global fusion weight. Error-code queries and conceptual questions need different leg emphasis. Use lightweight query routing instead of a single α.
Skipping ACL on one leg. A vector query that omits tenant_id is a data breach waiting for a semantic near-neighbor. Enforce the same predicates everywhere.
Tuning on answer quality only. A stronger LLM masks bad retrieval. Evaluate chunk recall directly; generator metrics are downstream.
Oversized context after fusion. Fused top-20 chunks blow token budgets and dilute attention. Fusion selects candidates; token budgeting (dedupe, MMR diversity, reranker trim) still applies before the model call.

Conclusion

Hybrid search is the default architecture for production RAG over mixed corpora: operational identifiers, narrative docs, and paraphrased user language all show up in the same traffic. Run lexical and vector retrieval in parallel, align on stable chunk identities, filter metadata and ACLs before fusion, and combine rankings with RRF until labeled data supports something smarter. Add reranking when fusion plateaus, and measure recall after fusion on query classes that mirror real tickets—not on paraphrase-only demos.

Getting retrieval right is most of the battle in grounded assistants. Teams that treat search as a first-class subsystem—versioned indexes, eval sets, traced fusion—ship copilots that stay trustworthy as the corpus grows. If you are designing or hardening RAG for a multi-tenant product, it helps to treat lexical and vector legs as one coordinated pipeline rather than bolting embeddings onto an existing FAQ search. For background on how this site approaches engineering work, see About; for collaboration or inquiries, Contact.

Receba um e-mail quando novos artigos forem publicados. Sem spam — apenas novos posts deste blog.

Via Resend. Você pode cancelar a inscrição em qualquer e-mail.