RUNLOCALAIv38
->Will it run?Best GPUCompareTroubleshootStartLearnPulseModelsHardwareToolsBench
Run check
RUNLOCALAI

Independently operated catalog for local-AI hardware and software. Hand-written verdicts. Source-cited claims. Reproducible commands when we have them.

OP·Fredoline Eruo
DIR
  • Models
  • Hardware
  • Tools
  • Benchmarks
TOOLS
  • Will it run?
  • Compare hardware
  • Cost vs cloud
  • Choose my GPU
  • Prompting kits
  • Quick answers
REF
  • All buyer guides
  • Learn local AI
  • Methodology
  • Glossary
  • Errors KB
  • Trust
EDITOR
  • About
  • Author
  • How we make money
  • Editorial policy
  • Contact
LEGAL
  • Privacy
  • Terms
  • Sitemap
MAIL · MONTHLY DIGEST
Get monthly local AI changes
Monthly recap. No spam.
DISCLOSURE

Some links on this site are affiliate links (Amazon Associates and other first-class retailers). When you buy through them, we earn a small commission at no extra cost to you. Affiliate links do not influence our verdicts — there are cards we rate highly that we don't have affiliate relationships with, and cards that sell well that we refuse to recommend. Read more →

© 2026 runlocalai.coIndependently operated
RUNLOCALAI · v38
  1. >
  2. Home
  3. /Learn
  4. /Courses
  5. /RAG Systems: Part 2
  6. /Ch. 2
RAG Systems: Part 2

02. Why Reranking Matters

Chapter 2 of 22 · 20 min
KEY INSIGHT

Reranking trades computation for accuracy: retrieve more than needed, then use a computationally expensive but more accurate scorer to select the best results.

Reranking is the practice of using a second-stage model to re-score and re-rank initial retrieval results. The first retrieval stage optimises for speed and recall; the reranking stage optimises for precision and relevance ordering.

The Two-Stage Retrieval Problem

Dense retrieval with k-nearest neighbors is a coarse operation. Vector similarity measures global semantic patterns, and it makes tradeoffs that don't always align with specific query intent. When you retrieve the top-50 chunks by vector similarity, you're getting approximate nearest neighbors, not necessarily the most relevant results.

Consider this scenario: a user asks "What are the approval criteria for expense reimbursement?" Your documents contain an expense policy with multiple relevant sections: general rules, specific category limits, exception procedures, and a table of approval thresholds. Basic retrieval might rank the table header chunk highly but miss the explanatory paragraph. It might retrieve the wrong person's expense policy if their name is similar to the query semantics.

Rerankers solve this by computing query-document relevance scores at higher granularity. A cross-encoder takes the query and a candidate document as a pair, attending to both simultaneously, rather than comparing a query vector to pre-computed document vectors.

Precision vs. Recall: The k Parameter Problem

In basic retrieval, k becomes a fixed guess. You choose k=10 or k=20 at pipeline design time, and it applies equally to every query.

Real queries have varying information density requirements. "Who approved the March meeting minutes?" requires one precise chunk. "What are the key themes in the Q4 financial report?" requires synthesis across many sections. "Compare the bonus structures across all departments" requires aggregating from multiple documents.

Reranking decouples recall from precision. You retrieve a large initial set (k=50, k=100, or more), then rerank to identify the truly relevant results. This gives high recall without forcing all 100 chunks into the LLM context.

EXERCISE

Run the same query with k=10, k=50, and k=200 initial retrieval. For each, manually label the top 5 reranked results as relevant or not. Measure how initial recall affects final precision.

Approach Initial Recall Final Precision k Parameter
Basic Retrieval Fixed at retrieval time Unknown until evaluation Must guess
Retrieval + Reranking High (large initial set) High (intelligent filtering) Post-retrieval decision

Learning-to-Rank vs. Cross-Encoders

Two reranking approaches exist. Learning-to-rank (LTR) models are trained on labeled query-document pairs to predict relevance scores. They require pre-labeled training data, which is expensive to produce. Cross-encoders are simpler: given any query-document pair, they output a relevance score without task-specific training (though fine-tuning helps).

Cross-encoders are more practical for most RAG systems. They work out-of-the-box on arbitrary query-document pairs, including queries they never saw during training. The tradeoff is inference cost: cross-encoders are slower because they process query and document together rather than comparing pre-computed vectors.

Common Reranking Architecture

Query → Embedding Model → Vector DB (k=100 retrieval)
                              ↓
                        Candidate Chunks
                              ↓
                      Cross-Encoder Reranker
                              ↓
                        Top-20 Reranked Chunks
                              ↓
                        LLM Context Window
                              ↓
                         Generated Answer

The cross-encoder computes a full cross-attention between query tokens and document tokens. This allows it to identify when document terms are relevant to query terms—something impossible with the separate encoding of bi-encoder retrieval.

Key Insight: Reranking trades computation for accuracy: retrieve more than needed, then use a computationally expensive but more accurate scorer to select the best results.

← Chapter 1
Part 1 Recap
Chapter 3 →
Cross-Encoder Setup