03. Cross-Encoder Setup
Cross-encoders take a query-document pair and output a relevance score. Setting one up requires choosing a model, configuring the inference pipeline, and integrating it into your retrieval flow.
Model Options
The most common cross-encoder models are:
ms-marco models (Microsoft) are trained specifically on query-document relevance from Bing search logs. ms-marco-MiniLM-L-6-v2 is a good balance of speed and quality. ms-marco-MiniLM-L-12-v2 offers higher quality at moderate cost.
BGE models (Beijing Academy of Artificial Intelligence) provide strong open-source alternatives. bge-reranker-base and bge-reranker-large offer good performance with Apache 2.0 licensing.
Cohere Rerank is a hosted API option with excellent quality. It handles infrastructure but introduces latency and dependency on an external service.
Using Sentence-Transformers
The sentence-transformers library provides cross-encoder implementations through its CrossEncoder class.
from sentence_transformers import CrossEncoder
# Initialize cross-encoder
reranker = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')
# Score a single query-document pair
score = reranker.predict(["What is the reimbursement limit?",
"Employees may submit expenses up to $500 without pre-approval."])
# Score multiple documents at once
scores = reranker.predict([
["What is the reimbursement limit?", chunk.content]
for chunk in retrieved_chunks
])
The predict method accepts a list of [query, document] pairs and returns relevance scores (higher = more relevant). Scores are not normalized—they're raw model outputs, but ordinal comparisons are meaningful.
Batch Scoring for Production
When reranking large candidate sets, batch processing improves throughput:
from tqdm import tqdm
def rerank_documents(query, candidates, reranker, batch_size=32, top_k=20):
"""
Rerank candidate documents and return top-k results.
Args:
query: User query string
candidates: List of document chunks
reranker: Initialized CrossEncoder
batch_size: Documents processed per batch
top_k: Number of results to return
Returns:
List of (chunk, score) tuples sorted by score descending
"""
pairs = [[query, chunk.content] for chunk in candidates]
all_scores = []
for i in range(0, len(pairs), batch_size):
batch = pairs[i:i + batch_size]
scores = reranker.predict(batch)
all_scores.extend(scores)
# Combine with metadata
scored_chunks = [
(candidates[i], all_scores[i])
for i in range(len(candidates))
]
# Sort by score descending
ranked = sorted(scored_chunks, key=lambda x: x[1], reverse=True)
return ranked[:top_k]
Without batching, reranking 100 candidates with a slow cross-encoder could take 10+ seconds. With batching, similar quality finishes in under 2 seconds on CPU.
Failure Modes
Model mismatch: Cross-encoders trained on web search data may not transfer well to specialized domains. A reranker trained on MS MARCO (web queries) might not score technical documentation relevance well. Evaluate with your actual data.
Score instability: Some cross-encoder models produce confident scores near 0 or 1, while others cluster in the middle range. Don't compare scores across models. Use ranking position, not raw score, when evaluating.
Context length limits: Most cross-encoders have maximum sequence lengths (typically 512 tokens). Longer documents get truncated before scoring. Chunk your documents to stay within limits during reranking.
Initialize a cross-encoder model, create 10 query-document pairs, and score them. Compare the top-3 ranked results against cosine similarity from your embedding model. Identify two cases where rankings differ and explain why.