RUNLOCALAIv38
->Will it run?Best GPUCompareTroubleshootStartLearnPulseModelsHardwareToolsBench
Run check
RUNLOCALAI

Independently operated catalog for local-AI hardware and software. Hand-written verdicts. Source-cited claims. Reproducible commands when we have them.

OP·Fredoline Eruo
DIR
  • Models
  • Hardware
  • Tools
  • Benchmarks
TOOLS
  • Will it run?
  • Compare hardware
  • Cost vs cloud
  • Choose my GPU
  • Prompting kits
  • Quick answers
REF
  • All buyer guides
  • Learn local AI
  • Methodology
  • Glossary
  • Errors KB
  • Trust
EDITOR
  • About
  • Author
  • How we make money
  • Editorial policy
  • Contact
LEGAL
  • Privacy
  • Terms
  • Sitemap
MAIL · MONTHLY DIGEST
Get monthly local AI changes
Monthly recap. No spam.
DISCLOSURE

Some links on this site are affiliate links (Amazon Associates and other first-class retailers). When you buy through them, we earn a small commission at no extra cost to you. Affiliate links do not influence our verdicts — there are cards we rate highly that we don't have affiliate relationships with, and cards that sell well that we refuse to recommend. Read more →

© 2026 runlocalai.coIndependently operated
RUNLOCALAI · v38
  1. >
  2. Home
  3. /Learn
  4. /Courses
  5. /Advanced RAG — Chunking, Retrieval, Re-ranking
  6. /Ch. 9
Advanced RAG — Chunking, Retrieval, Re-ranking

09. Cross-Encoder Setup

Chapter 9 of 24 · 15 min
KEY INSIGHT

Cross-encoders provide more accurate relevance signals than bi-encoder similarity, but the computational cost limits their use to reranking post-retrieval candidates.

Cross-encoders jointly encode query-document pairs, enabling precise relevance scoring at the cost of computation time. They serve as rerankers that refine initial retrieval results.

Architecture difference: Bi-encoders (used in dense retrieval) encode queries and documents independently, producing embeddings compared via similarity. Cross-encoders concatenate query and document, producing a single relevance score.

When to use cross-encoders: After initial retrieval narrows candidates to a manageable set (typically 50-100). Full cross-encoder scoring over millions of documents is computationally prohibitive.

from sentence_transformers import CrossEncoder

class CrossEncoderReranker:
    def __init__(self, model_name: str, max_length: int = 512):
        """
        Initialize cross-encoder reranker.
        
        Args:
            model_name: Hugging Face model identifier (e.g., 'cross-encoder/ms-marco-MiniLM-L-6-v2')
            max_length: Maximum sequence length
        """
        self.model = CrossEncoder(model_name, max_length=max_length)
    
    def rerank(self, query: str, candidates: List[dict], top_k: int = 10) -> List[dict]:
        """
        Rerank candidate documents by cross-encoder relevance scores.
        
        Args:
            query: User query string
            candidates: List of dicts with 'text' or 'content' field
            top_k: Number of results to return
        
        Returns:
            Reranked list with relevance scores
        """
        # Prepare query-document pairs
        doc_texts = []
        for candidate in candidates:
            text = candidate.get('text', candidate.get('content', ''))
            doc_texts.append(text)
        
        pairs = [(query, doc) for doc in doc_texts]
        
        # Get relevance scores
        scores = self.model.predict(pairs)
        
        # Combine with original metadata and sort
        scored_candidates = []
        for candidate, score in zip(candidates, scores):
            scored = candidate.copy()
            scored['cross_encoder_score'] = float(score)
            scored_candidates.append(scored)
        
        scored_candidates.sort(key=lambda x: x['cross_encoder_score'], reverse=True)
        
        return scored_candidates[:top_k]

Local verification checkpoint

Run the smallest example from this chapter in a local workspace and record the package version, runtime, data path, and observed output. If the result depends on model size, vector count, CPU/GPU backend, or available memory, note that constraint beside the exercise so the lesson remains reproducible.

Local verification checkpoint

Run the smallest example from this chapter in a local workspace and record the package version, runtime, data path, and observed output. If the result depends on model size, vector count, CPU/GPU backend, or available memory, note that constraint beside the exercise so the lesson remains reproducible.

EXERCISE

Measure latency (P50, P95, P99) for cross-encoder reranking of 100 candidates. Compare against pure dense retrieval latency.

← Chapter 8
Weighted Hybrid Strategies
Chapter 10 →
Local Cross-Encoder Models