Retrieval Strategies — RAG Systems: Part 1 (Chapter 12)

Retrieval is the most critical and most commonly broken part of any RAG system. A model can only generate correct answers if it receives relevant context. This chapter covers three retrieval strategies in depth.

Query Rewriting

Before searching, rewrite the user query to match how documents are written. A query like "how do I fix error code 500" should become "HTTP 500 internal server error troubleshooting". Query rewriting improves recall when users ask vague questions.

from your_rag_library import QueryRewriter

rewriter = QueryRewriter(strategy="decompose")
rewritten_queries = rewriter.rewrite(
    "What machines need regular maintenance?"
)
# Returns: ["machines needing regular maintenance", 
#           "maintenance schedules for equipment",
#           "preventive maintenance requirements"]

Decompose splits compound questions into sub-queries. This strategy works when users ask multi-part questions like "How do I install it and what are the system requirements?" Each sub-query retrieves different relevant chunks.

Hybrid Search

No single retrieval method works for all queries. Dense retrieval excels at semantic similarity ("neural network best practices") but struggles with exact identifiers ("model v2.3.1"). Sparse methods like BM25 excel at exact term matches but miss synonyms.

Hybrid search combines dense and sparse scores:

from your_rag_library import HybridRetriever

retriever = HybridRetriever(
    dense_weight=0.6,  # 60% semantic similarity
    sparse_weight=0.4  # 40% keyword match
)

results = retriever.search(
    query="python async await",
    top_k=10,
    alpha=0.5  # Balance between dense (1.0) and sparse (0.0)
)

Setting alpha=0.7 favors semantic matching. Setting alpha=0.3 favors exact keyword matching. Tune this based on your document structure. Technical documentation with version numbers benefits from lower alpha values.

Reranking

First-stage retrieval optimizes for speed and recall. Reranking optimizes for precision. A cross-encoder reranker takes query-document pairs and outputs relevance scores:

from sentence_transformers import CrossEncoder

reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")

# First stage: fast vector search
initial_results = vector_store.search(query, top_k=50)

# Second stage: precise reranking  
reranked = reranker.rank(query, initial_results)
# Returns: List of chunks sorted by true relevance

Reranking introduces 100-300ms latency. Use it as a second stage when latency allows. For streaming responses, rerank asynchronously and display initial results while awaiting reranked results.

When to Use Each Strategy

Three-stage retrieval works best for production systems:

Initial fast retrieval: Return top 50-100 results using vector search
Rerank: Reduce to top 10-20 using cross-encoder
Return: Present top 5-10 to generation model

This is computationally expensive but produces significantly better answers for complex queries.