12. Retrieval Strategies
Retrieval is the most critical and most commonly broken part of any RAG system. A model can only generate correct answers if it receives relevant context. This chapter covers three retrieval strategies in depth.
Query Rewriting
Before searching, rewrite the user query to match how documents are written. A query like "how do I fix error code 500" should become "HTTP 500 internal server error troubleshooting". Query rewriting improves recall when users ask vague questions.
from your_rag_library import QueryRewriter
rewriter = QueryRewriter(strategy="decompose")
rewritten_queries = rewriter.rewrite(
"What machines need regular maintenance?"
)
# Returns: ["machines needing regular maintenance",
# "maintenance schedules for equipment",
# "preventive maintenance requirements"]
Decompose splits compound questions into sub-queries. This strategy works when users ask multi-part questions like "How do I install it and what are the system requirements?" Each sub-query retrieves different relevant chunks.
Hybrid Search
No single retrieval method works for all queries. Dense retrieval excels at semantic similarity ("neural network best practices") but struggles with exact identifiers ("model v2.3.1"). Sparse methods like BM25 excel at exact term matches but miss synonyms.
Hybrid search combines dense and sparse scores:
from your_rag_library import HybridRetriever
retriever = HybridRetriever(
dense_weight=0.6, # 60% semantic similarity
sparse_weight=0.4 # 40% keyword match
)
results = retriever.search(
query="python async await",
top_k=10,
alpha=0.5 # Balance between dense (1.0) and sparse (0.0)
)
Setting alpha=0.7 favors semantic matching. Setting alpha=0.3 favors exact keyword matching. Tune this based on your document structure. Technical documentation with version numbers benefits from lower alpha values.
Reranking
First-stage retrieval optimizes for speed and recall. Reranking optimizes for precision. A cross-encoder reranker takes query-document pairs and outputs relevance scores:
from sentence_transformers import CrossEncoder
reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")
# First stage: fast vector search
initial_results = vector_store.search(query, top_k=50)
# Second stage: precise reranking
reranked = reranker.rank(query, initial_results)
# Returns: List of chunks sorted by true relevance
Reranking introduces 100-300ms latency. Use it as a second stage when latency allows. For streaming responses, rerank asynchronously and display initial results while awaiting reranked results.
When to Use Each Strategy
Three-stage retrieval works best for production systems:
- Initial fast retrieval: Return top 50-100 results using vector search
- Rerank: Reduce to top 10-20 using cross-encoder
- Return: Present top 5-10 to generation model
This is computationally expensive but produces significantly better answers for complex queries.
Implement hybrid search combining dense (top 50) with sparse BM25 (top 50), merge results by weighted score, then rerank top 20 using a cross-encoder. Compare hit rates for exact term queries versus semantic queries.