02. Why Reranking Matters
Reranking is the practice of using a second-stage model to re-score and re-rank initial retrieval results. The first retrieval stage optimises for speed and recall; the reranking stage optimises for precision and relevance ordering.
The Two-Stage Retrieval Problem
Dense retrieval with k-nearest neighbors is a coarse operation. Vector similarity measures global semantic patterns, and it makes tradeoffs that don't always align with specific query intent. When you retrieve the top-50 chunks by vector similarity, you're getting approximate nearest neighbors, not necessarily the most relevant results.
Consider this scenario: a user asks "What are the approval criteria for expense reimbursement?" Your documents contain an expense policy with multiple relevant sections: general rules, specific category limits, exception procedures, and a table of approval thresholds. Basic retrieval might rank the table header chunk highly but miss the explanatory paragraph. It might retrieve the wrong person's expense policy if their name is similar to the query semantics.
Rerankers solve this by computing query-document relevance scores at higher granularity. A cross-encoder takes the query and a candidate document as a pair, attending to both simultaneously, rather than comparing a query vector to pre-computed document vectors.
Precision vs. Recall: The k Parameter Problem
In basic retrieval, k becomes a fixed guess. You choose k=10 or k=20 at pipeline design time, and it applies equally to every query.
Real queries have varying information density requirements. "Who approved the March meeting minutes?" requires one precise chunk. "What are the key themes in the Q4 financial report?" requires synthesis across many sections. "Compare the bonus structures across all departments" requires aggregating from multiple documents.
Reranking decouples recall from precision. You retrieve a large initial set (k=50, k=100, or more), then rerank to identify the truly relevant results. This gives high recall without forcing all 100 chunks into the LLM context.
Run the same query with k=10, k=50, and k=200 initial retrieval. For each, manually label the top 5 reranked results as relevant or not. Measure how initial recall affects final precision.
| Approach | Initial Recall | Final Precision | k Parameter |
|---|---|---|---|
| Basic Retrieval | Fixed at retrieval time | Unknown until evaluation | Must guess |
| Retrieval + Reranking | High (large initial set) | High (intelligent filtering) | Post-retrieval decision |
Learning-to-Rank vs. Cross-Encoders
Two reranking approaches exist. Learning-to-rank (LTR) models are trained on labeled query-document pairs to predict relevance scores. They require pre-labeled training data, which is expensive to produce. Cross-encoders are simpler: given any query-document pair, they output a relevance score without task-specific training (though fine-tuning helps).
Cross-encoders are more practical for most RAG systems. They work out-of-the-box on arbitrary query-document pairs, including queries they never saw during training. The tradeoff is inference cost: cross-encoders are slower because they process query and document together rather than comparing pre-computed vectors.
Common Reranking Architecture
Query → Embedding Model → Vector DB (k=100 retrieval)
↓
Candidate Chunks
↓
Cross-Encoder Reranker
↓
Top-20 Reranked Chunks
↓
LLM Context Window
↓
Generated Answer
The cross-encoder computes a full cross-attention between query tokens and document tokens. This allows it to identify when document terms are relevant to query terms—something impossible with the separate encoding of bi-encoder retrieval.
Key Insight: Reranking trades computation for accuracy: retrieve more than needed, then use a computationally expensive but more accurate scorer to select the best results.