Cross-Encoder Setup — RAG Systems: Part 2 (Chapter 3)

Cross-encoders take a query-document pair and output a relevance score. Setting one up requires choosing a model, configuring the inference pipeline, and integrating it into your retrieval flow.

Model Options

The most common cross-encoder models are:

ms-marco models (Microsoft) are trained specifically on query-document relevance from Bing search logs. ms-marco-MiniLM-L-6-v2 is a good balance of speed and quality. ms-marco-MiniLM-L-12-v2 offers higher quality at moderate cost.

BGE models (Beijing Academy of Artificial Intelligence) provide strong open-source alternatives. bge-reranker-base and bge-reranker-large offer good performance with Apache 2.0 licensing.

Cohere Rerank is a hosted API option with excellent quality. It handles infrastructure but introduces latency and dependency on an external service.

Using Sentence-Transformers

The sentence-transformers library provides cross-encoder implementations through its CrossEncoder class.

from sentence_transformers import CrossEncoder

# Initialize cross-encoder
reranker = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')

# Score a single query-document pair
score = reranker.predict(["What is the reimbursement limit?", 
                          "Employees may submit expenses up to $500 without pre-approval."])

# Score multiple documents at once
scores = reranker.predict([
    ["What is the reimbursement limit?", chunk.content] 
    for chunk in retrieved_chunks
])

The predict method accepts a list of [query, document] pairs and returns relevance scores (higher = more relevant). Scores are not normalized—they're raw model outputs, but ordinal comparisons are meaningful.

Batch Scoring for Production

When reranking large candidate sets, batch processing improves throughput:

from tqdm import tqdm

def rerank_documents(query, candidates, reranker, batch_size=32, top_k=20):
    """
    Rerank candidate documents and return top-k results.
    
    Args:
        query: User query string
        candidates: List of document chunks
        reranker: Initialized CrossEncoder
        batch_size: Documents processed per batch
        top_k: Number of results to return
    
    Returns:
        List of (chunk, score) tuples sorted by score descending
    """
    pairs = [[query, chunk.content] for chunk in candidates]
    
    all_scores = []
    for i in range(0, len(pairs), batch_size):
        batch = pairs[i:i + batch_size]
        scores = reranker.predict(batch)
        all_scores.extend(scores)
    
    # Combine with metadata
    scored_chunks = [
        (candidates[i], all_scores[i]) 
        for i in range(len(candidates))
    ]
    
    # Sort by score descending
    ranked = sorted(scored_chunks, key=lambda x: x[1], reverse=True)
    
    return ranked[:top_k]

Without batching, reranking 100 candidates with a slow cross-encoder could take 10+ seconds. With batching, similar quality finishes in under 2 seconds on CPU.

Failure Modes

Model mismatch: Cross-encoders trained on web search data may not transfer well to specialized domains. A reranker trained on MS MARCO (web queries) might not score technical documentation relevance well. Evaluate with your actual data.

Score instability: Some cross-encoder models produce confident scores near 0 or 1, while others cluster in the middle range. Don't compare scores across models. Use ranking position, not raw score, when evaluating.

Context length limits: Most cross-encoders have maximum sequence lengths (typically 512 tokens). Longer documents get truncated before scoring. Chunk your documents to stay within limits during reranking.