HOW-TO · RAG

How to Build BM25 and Vector Hybrid Retrieval

intermediate25 minBy Fredoline Eruo
Target environment
Ubuntu 24.04 · Ollama 0.4.x
PREREQUISITES

BM25 library (rank_bm25), vector store, sample documents indexed

What this does

Hybrid retrieval combines BM25 keyword-based retrieval with dense vector semantic search to capture both exact term matches and semantic meaning. BM25 handles synonym-blind queries (e.g., "deploy kubernetes pod") with precision, while vector search recovers semantically related documents that use different vocabulary. The outputs of both retrievers are merged using Reciprocal Rank Fusion or a similar late-interaction technique to produce a unified ranked list.

Steps

  1. Tokenize your indexed document chunks for BM25 indexing using the same tokenization scheme (typically whitespace or simple regex-based splitting). Build a BM25Okapi index from the tokenized corpus.
  2. For a given query, run BM25 retrieval by tokenizing the query and calling bm25_index.get_scores() or bm25_index.get_top_n() to get the top-N keyword-matching documents.
  3. Run vector retrieval by encoding the query with your embedding model and performing a similarity search against the vector store to get the top-N semantically similar documents.
  4. Combine the two result sets using Reciprocal Rank Fusion: for each document, compute RRF_score = 1 / (k + bm25_rank) + 1 / (k + vector_rank), where k is a constant (commonly 60).
  5. Documents appearing in only one retriever receive a default rank of infinity, ensuring they are still considered.
  6. Sort all documents by their combined RRF score in descending order and return the top-K results.
  7. Tune the parameter k experimentally: lower values favor the top-ranked result from each retriever, higher values smooth the fusion toward documents that appear in both rankings.

Verification

Query your hybrid system with a keyword-heavy question (e.g., "PostgreSQL connection timeout error"). Verify that BM25 returns relevant matches using exact term overlap, and that vector search returns semantically related matches even when terminology differs.

Expected output: BM25 top doc: "PostgreSQL timeout troubleshooting guide" (score 8.4). Vector top doc: "Database connection failure resolution" (score 0.91). Fused top doc: "PostgreSQL timeout troubleshooting guide" (RRF 0.049). BM25 and vector retrieved 8 common documents, fusion reordered 3 documents from the lower half of each list into the top 5.

Common failures

  1. Tokenization mismatch between indexing and query: If your BM25 index was built with a different tokenizer than your query, term matching fails. Ensure both the indexing and query phases use identical tokenization rules (lowercasing, punctuation stripping, whitespace splitting).
  2. Document ID alignment between systems: BM25 and vector stores use different internal ID schemes. You must map both retriever outputs to a common document identifier before fusion. Use a shared unique ID field (e.g., chunk_id) stored as metadata in both the BM25 index and the vector store.
  3. Fusion parameter k produces poor ranking: A k value that is too small makes the fusion dominated by whichever retriever ranks first, while a k that is too large equalizes rankings too much and eliminates the benefit of fusion. Test k values of 20, 40, 60, and 80 against your labeled evaluation set to find the optimal value.

Related guides

  • evaluate-reranking-quality-ndcg
  • setup-cross-encoder-reranking