RUNLOCALAIv38
->Will it run?Best GPUCompareTroubleshootStartLearnPulseModelsHardwareToolsBench
Run check
RUNLOCALAI

Independently operated catalog for local-AI hardware and software. Hand-written verdicts. Source-cited claims. Reproducible commands when we have them.

OP·Fredoline Eruo
DIR
  • Models
  • Hardware
  • Tools
  • Benchmarks
TOOLS
  • Will it run?
  • Compare hardware
  • Cost vs cloud
  • Choose my GPU
  • Prompting kits
  • Quick answers
REF
  • All buyer guides
  • Learn local AI
  • Methodology
  • Glossary
  • Errors KB
  • Trust
EDITOR
  • About
  • Author
  • How we make money
  • Editorial policy
  • Contact
LEGAL
  • Privacy
  • Terms
  • Sitemap
MAIL · MONTHLY DIGEST
Get monthly local AI changes
Monthly recap. No spam.
DISCLOSURE

Some links on this site are affiliate links (Amazon Associates and other first-class retailers). When you buy through them, we earn a small commission at no extra cost to you. Affiliate links do not influence our verdicts — there are cards we rate highly that we don't have affiliate relationships with, and cards that sell well that we refuse to recommend. Read more →

© 2026 runlocalai.coIndependently operated
RUNLOCALAI · v38
  1. >
  2. Home
  3. /Learn
  4. /Courses
  5. /RAG Systems: Part 1
  6. /Ch. 12
RAG Systems: Part 1

12. Retrieval Strategies

Chapter 12 of 22 · 20 min
KEY INSIGHT

Hybrid search with reranking consistently outperforms any single retrieval method across diverse query types.

Retrieval is the most critical and most commonly broken part of any RAG system. A model can only generate correct answers if it receives relevant context. This chapter covers three retrieval strategies in depth.

Query Rewriting

Before searching, rewrite the user query to match how documents are written. A query like "how do I fix error code 500" should become "HTTP 500 internal server error troubleshooting". Query rewriting improves recall when users ask vague questions.

from your_rag_library import QueryRewriter

rewriter = QueryRewriter(strategy="decompose")
rewritten_queries = rewriter.rewrite(
    "What machines need regular maintenance?"
)
# Returns: ["machines needing regular maintenance", 
#           "maintenance schedules for equipment",
#           "preventive maintenance requirements"]

Decompose splits compound questions into sub-queries. This strategy works when users ask multi-part questions like "How do I install it and what are the system requirements?" Each sub-query retrieves different relevant chunks.

Hybrid Search

No single retrieval method works for all queries. Dense retrieval excels at semantic similarity ("neural network best practices") but struggles with exact identifiers ("model v2.3.1"). Sparse methods like BM25 excel at exact term matches but miss synonyms.

Hybrid search combines dense and sparse scores:

from your_rag_library import HybridRetriever

retriever = HybridRetriever(
    dense_weight=0.6,  # 60% semantic similarity
    sparse_weight=0.4  # 40% keyword match
)

results = retriever.search(
    query="python async await",
    top_k=10,
    alpha=0.5  # Balance between dense (1.0) and sparse (0.0)
)

Setting alpha=0.7 favors semantic matching. Setting alpha=0.3 favors exact keyword matching. Tune this based on your document structure. Technical documentation with version numbers benefits from lower alpha values.

Reranking

First-stage retrieval optimizes for speed and recall. Reranking optimizes for precision. A cross-encoder reranker takes query-document pairs and outputs relevance scores:

from sentence_transformers import CrossEncoder

reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")

# First stage: fast vector search
initial_results = vector_store.search(query, top_k=50)

# Second stage: precise reranking  
reranked = reranker.rank(query, initial_results)
# Returns: List of chunks sorted by true relevance

Reranking introduces 100-300ms latency. Use it as a second stage when latency allows. For streaming responses, rerank asynchronously and display initial results while awaiting reranked results.

When to Use Each Strategy

Three-stage retrieval works best for production systems:

  1. Initial fast retrieval: Return top 50-100 results using vector search
  2. Rerank: Reduce to top 10-20 using cross-encoder
  3. Return: Present top 5-10 to generation model

This is computationally expensive but produces significantly better answers for complex queries.

EXERCISE

Implement hybrid search combining dense (top 50) with sparse BM25 (top 50), merge results by weighted score, then rerank top 20 using a cross-encoder. Compare hit rates for exact term queries versus semantic queries.

← Chapter 11
Storing Embeddings in ChromaDB
Chapter 13 →
Dense Retrieval