How to Implement Cascade Reranking for Efficiency
Multiple reranking models of varying complexity available
What this does
Cascade reranking applies progressively more computationally expensive reranking models in stages, reserving the heaviest models for only the most promising candidates. The approach dramatically reduces latency and compute cost compared to running a single heavy reranker across the entire initial retrieval set. A lightweight model filters the top-k candidates from a larger pool, and only those survivors advance to the next stage where a stronger model re-evaluates them.
Steps
Configure initial retrieval to return an oversized candidate pool of 50-100 documents from your vector store. Use a fast ANN index for this retrieval pass.
Apply the lightweight reranker first. Score every candidate in the pool using your smallest model. Sort by score and keep the top 10-20 candidates.
Apply the heavyweight reranker second. Run your strongest cross-encoder reranker only on the shortlist from step 2. Produce the final ranked ordering.
Tune the stage-one pool size based on recall testing. Too small risks dropping relevant documents; too large wastes compute on the heavy model.
Cache stage-one scores when possible, since they remain valid across queries that share candidates from the same ANN retrieval pass.
Verification
Run your pipeline on a representative query set and measure end-to-end latency:
python evaluate_cascade.py --queries test_queries.jsonl \
--stage1-model models/reranker-light \
--stage2-model models/reranker-heavy \
--pool-size 80 \
--final-k 10
Expected output: Final ranked list of 10 documents with MRR@10 score and per-stage latency. Latency for the heavy reranker should be limited to the shortlist (e.g., 80 docs → ~0.3s) rather than the full pool.
Common failures
- Stage-one pool too small: Relevant documents are dropped before the heavy reranker sees them, causing recall loss. Increase pool size or adjust the threshold.
- Model quality gap too wide: The lightweight reranker is so weak it misranks everything, so the heavy reranker cannot recover. Validate that stage one alone achieves reasonable recall.
- Latency regression: Adding a second stage backfires if the lightweight model is not actually fast. Profile each stage independently and ensure sum is below a single-pass baseline.
- Inconsistent scoring between stages: Different model vocabularies or tokenization produce incompatible score ranges, leading to poor shortlist selection.
- Version mismatch - The installed package or runtime differs from the command shown; check the version first and rerun the smallest verification command.
- Local environment drift - Another service, virtual environment, model, or path is being used; print the active binary path and configuration before changing the guide steps.
Related guides
- fine-tune-reranking-models-domain
- optimize-vector-search-query-performance