RUNLOCALAIv38
->Will it run?Best GPUCompareTroubleshootStartLearnPulseModelsHardwareToolsBench
Run check
RUNLOCALAI

Independently operated catalog for local-AI hardware and software. Hand-written verdicts. Source-cited claims. Reproducible commands when we have them.

OP·Fredoline Eruo
DIR
  • Models
  • Hardware
  • Tools
  • Benchmarks
TOOLS
  • Will it run?
  • Compare hardware
  • Cost vs cloud
  • Choose my GPU
  • Prompting kits
  • Quick answers
REF
  • All buyer guides
  • Learn local AI
  • Methodology
  • Glossary
  • Errors KB
  • Trust
EDITOR
  • About
  • Author
  • How we make money
  • Editorial policy
  • Contact
LEGAL
  • Privacy
  • Terms
  • Sitemap
MAIL · MONTHLY DIGEST
Get monthly local AI changes
Monthly recap. No spam.
DISCLOSURE

Some links on this site are affiliate links (Amazon Associates and other first-class retailers). When you buy through them, we earn a small commission at no extra cost to you. Affiliate links do not influence our verdicts — there are cards we rate highly that we don't have affiliate relationships with, and cards that sell well that we refuse to recommend. Read more →

© 2026 runlocalai.coIndependently operated
RUNLOCALAI · v38
  1. >
  2. Home
  3. /Learn
  4. /Courses
  5. /RAG Systems: Part 2
  6. /Ch. 3
RAG Systems: Part 2

03. Cross-Encoder Setup

Chapter 3 of 22 · 20 min
KEY INSIGHT

Cross-encoders compute joint query-document attention, which is slower than bi-encoder vector comparison but captures relevance that bi-encoders miss.

Cross-encoders take a query-document pair and output a relevance score. Setting one up requires choosing a model, configuring the inference pipeline, and integrating it into your retrieval flow.

Model Options

The most common cross-encoder models are:

ms-marco models (Microsoft) are trained specifically on query-document relevance from Bing search logs. ms-marco-MiniLM-L-6-v2 is a good balance of speed and quality. ms-marco-MiniLM-L-12-v2 offers higher quality at moderate cost.

BGE models (Beijing Academy of Artificial Intelligence) provide strong open-source alternatives. bge-reranker-base and bge-reranker-large offer good performance with Apache 2.0 licensing.

Cohere Rerank is a hosted API option with excellent quality. It handles infrastructure but introduces latency and dependency on an external service.

Using Sentence-Transformers

The sentence-transformers library provides cross-encoder implementations through its CrossEncoder class.

from sentence_transformers import CrossEncoder

# Initialize cross-encoder
reranker = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')

# Score a single query-document pair
score = reranker.predict(["What is the reimbursement limit?", 
                          "Employees may submit expenses up to $500 without pre-approval."])

# Score multiple documents at once
scores = reranker.predict([
    ["What is the reimbursement limit?", chunk.content] 
    for chunk in retrieved_chunks
])

The predict method accepts a list of [query, document] pairs and returns relevance scores (higher = more relevant). Scores are not normalized—they're raw model outputs, but ordinal comparisons are meaningful.

Batch Scoring for Production

When reranking large candidate sets, batch processing improves throughput:

from tqdm import tqdm

def rerank_documents(query, candidates, reranker, batch_size=32, top_k=20):
    """
    Rerank candidate documents and return top-k results.
    
    Args:
        query: User query string
        candidates: List of document chunks
        reranker: Initialized CrossEncoder
        batch_size: Documents processed per batch
        top_k: Number of results to return
    
    Returns:
        List of (chunk, score) tuples sorted by score descending
    """
    pairs = [[query, chunk.content] for chunk in candidates]
    
    all_scores = []
    for i in range(0, len(pairs), batch_size):
        batch = pairs[i:i + batch_size]
        scores = reranker.predict(batch)
        all_scores.extend(scores)
    
    # Combine with metadata
    scored_chunks = [
        (candidates[i], all_scores[i]) 
        for i in range(len(candidates))
    ]
    
    # Sort by score descending
    ranked = sorted(scored_chunks, key=lambda x: x[1], reverse=True)
    
    return ranked[:top_k]

Without batching, reranking 100 candidates with a slow cross-encoder could take 10+ seconds. With batching, similar quality finishes in under 2 seconds on CPU.

Failure Modes

Model mismatch: Cross-encoders trained on web search data may not transfer well to specialized domains. A reranker trained on MS MARCO (web queries) might not score technical documentation relevance well. Evaluate with your actual data.

Score instability: Some cross-encoder models produce confident scores near 0 or 1, while others cluster in the middle range. Don't compare scores across models. Use ranking position, not raw score, when evaluating.

Context length limits: Most cross-encoders have maximum sequence lengths (typically 512 tokens). Longer documents get truncated before scoring. Chunk your documents to stay within limits during reranking.

EXERCISE

Initialize a cross-encoder model, create 10 query-document pairs, and score them. Compare the top-3 ranked results against cosine similarity from your embedding model. Identify two cases where rankings differ and explain why.

← Chapter 2
Why Reranking Matters
Chapter 4 →
Local Cross-Encoder Models