HOW-TO · RAG

How to Set Up Cross-Encoder Reranking

intermediate20 minBy Fredoline Eruo
Target environment
Ubuntu 24.04 · Ollama 0.4.x
PREREQUISITES

sentence-transformers with cross-encoder models installed

What this does

A cross-encoder reranker takes a query and a candidate document together as a paired input and outputs a relevance score. Unlike bi-encoders that score each independently, cross-encoders learn joint representations, producing more accurate relevance signals at the cost of slower inference. In a two-stage retrieval pipeline, you first retrieve a broad set of candidates using fast vector similarity, then rerank them with a cross-encoder to surface the most relevant results.

Steps

  1. Select a cross-encoder model from the Hugging Face model hub (e.g., cross-encoder/ms-marco-MiniLM-L-6-v2) and initialize a CrossEncoder instance.
  2. Create a candidate list by running your initial vector search and collecting the top-N results (N=50–200 is typical).
  3. Build a list of (query, document_text) pairs from the candidate list.
  4. Pass the pair list to the cross-encoder's predict() method to generate relevance scores for all candidates simultaneously.
  5. Sort the candidate list by cross-encoder scores in descending order.
  6. Return the top-K reranked results (K=5–10) as your final retrieval output.
  7. Optionally combine the cross-encoder score with the original vector similarity using Reciprocal Rank Fusion to produce a hybrid ranking.

Verification

Query your pipeline with a test question and inspect the reranked output. The top result should score higher than the initial vector search top result on semantic relevance. Log the cross-encoder scores alongside the original rankings to confirm that reordering occurred based on the joint scoring.

Expected output: Reranked results: doc_42 (score: 0.94), doc_17 (score: 0.87), doc_3 (score: 0.81). Original vector rankings: doc_17, doc_42, doc_3. Reranking reordered doc_42 to first position.

Common failures

  1. Cross-encoder model too large for inference latency: Large models like cross-encoder/ms-marco-electra-base may introduce 200–500ms per candidate. Use MiniLM variants for latency-sensitive pipelines, or batch all candidates into a single predict() call rather than streaming.
  2. Input truncation loses relevant content: Cross-encoders have a max token limit (often 512 tokens). Long candidate documents get truncated, removing the most relevant section at the end. Truncate from the start and rely on initial retrieval to surface documents where relevant content appears early, or split long candidates before reranking.
  3. Score calibration inconsistency across models: Cross-encoder scores are not comparable across different model architectures. If you switch models, do not compare new scores to old scores directly; instead, evaluate ranking quality using relative ordering or NDCG metrics.

Related guides

  • evaluate-reranking-quality-ndcg
  • implement-cohere-reranking-api