02. Why Reranking Matters

Chapter 2 of 22 · 20 min

Reranking is the practice of using a second-stage model to re-score and re-rank initial retrieval results. The first retrieval stage optimises for speed and recall; the reranking stage optimises for precision and relevance ordering.

The Two-Stage Retrieval Problem

Dense retrieval with k-nearest neighbors is a coarse operation. Vector similarity measures global semantic patterns, and it makes tradeoffs that don't always align with specific query intent. When you retrieve the top-50 chunks by vector similarity, you're getting approximate nearest neighbors, not necessarily the most relevant results.

Consider this scenario: a user asks "What are the approval criteria for expense reimbursement?" Your documents contain an expense policy with multiple relevant sections: general rules, specific category limits, exception procedures, and a table of approval thresholds. Basic retrieval might rank the table header chunk highly but miss the explanatory paragraph. It might retrieve the wrong person's expense policy if their name is similar to the query semantics.

Rerankers solve this by computing query-document relevance scores at higher granularity. A cross-encoder takes the query and a candidate document as a pair, attending to both simultaneously, rather than comparing a query vector to pre-computed document vectors.

Precision vs. Recall: The k Parameter Problem

In basic retrieval, k becomes a fixed guess. You choose k=10 or k=20 at pipeline design time, and it applies equally to every query.

Real queries have varying information density requirements. "Who approved the March meeting minutes?" requires one precise chunk. "What are the key themes in the Q4 financial report?" requires synthesis across many sections. "Compare the bonus structures across all departments" requires aggregating from multiple documents.

Reranking decouples recall from precision. You retrieve a large initial set (k=50, k=100, or more), then rerank to identify the truly relevant results. This gives high recall without forcing all 100 chunks into the LLM context.

EXERCISE

Run the same query with k=10, k=50, and k=200 initial retrieval. For each, manually label the top 5 reranked results as relevant or not. Measure how initial recall affects final precision.

Approach	Initial Recall	Final Precision	k Parameter
Basic Retrieval	Fixed at retrieval time	Unknown until evaluation	Must guess
Retrieval + Reranking	High (large initial set)	High (intelligent filtering)	Post-retrieval decision

Learning-to-Rank vs. Cross-Encoders

Two reranking approaches exist. Learning-to-rank (LTR) models are trained on labeled query-document pairs to predict relevance scores. They require pre-labeled training data, which is expensive to produce. Cross-encoders are simpler: given any query-document pair, they output a relevance score without task-specific training (though fine-tuning helps).

Cross-encoders are more practical for most RAG systems. They work out-of-the-box on arbitrary query-document pairs, including queries they never saw during training. The tradeoff is inference cost: cross-encoders are slower because they process query and document together rather than comparing pre-computed vectors.

Common Reranking Architecture

Query → Embedding Model → Vector DB (k=100 retrieval)
                              ↓
                        Candidate Chunks
                              ↓
                      Cross-Encoder Reranker
                              ↓
                        Top-20 Reranked Chunks
                              ↓
                        LLM Context Window
                              ↓
                         Generated Answer

The cross-encoder computes a full cross-attention between query tokens and document tokens. This allows it to identify when document terms are relevant to query terms—something impossible with the separate encoding of bi-encoder retrieval.

Key Insight: Reranking trades computation for accuracy: retrieve more than needed, then use a computationally expensive but more accurate scorer to select the best results.