RUNLOCALAIv38
->Will it run?Best GPUCompareTroubleshootStartLearnPulseModelsHardwareToolsBench
Run check
RUNLOCALAI

Independently operated catalog for local-AI hardware and software. Hand-written verdicts. Source-cited claims. Reproducible commands when we have them.

OP·Fredoline Eruo
DIR
  • Models
  • Hardware
  • Tools
  • Benchmarks
TOOLS
  • Will it run?
  • Compare hardware
  • Cost vs cloud
  • Choose my GPU
  • Prompting kits
  • Quick answers
REF
  • All buyer guides
  • Learn local AI
  • Methodology
  • Glossary
  • Errors KB
  • Trust
EDITOR
  • About
  • Author
  • How we make money
  • Editorial policy
  • Contact
LEGAL
  • Privacy
  • Terms
  • Sitemap
MAIL · MONTHLY DIGEST
Get monthly local AI changes
Monthly recap. No spam.
DISCLOSURE

Some links on this site are affiliate links (Amazon Associates and other first-class retailers). When you buy through them, we earn a small commission at no extra cost to you. Affiliate links do not influence our verdicts — there are cards we rate highly that we don't have affiliate relationships with, and cards that sell well that we refuse to recommend. Read more →

© 2026 runlocalai.coIndependently operated
RUNLOCALAI · v38
  1. >
  2. Home
  3. /Learn
  4. /Courses
  5. /Advanced RAG — Chunking, Retrieval, Re-ranking
  6. /Ch. 10
Advanced RAG — Chunking, Retrieval, Re-ranking

10. Local Cross-Encoder Models

Chapter 10 of 24 · 15 min
KEY INSIGHT

Local cross-encoders enable offline reranking but require hardware-aware model selection to balance latency and quality.

Running cross-encoders locally eliminates API dependencies and enables offline operation. Local models require careful selection based on hardware constraints.

Model size vs. quality: Larger models (more parameters) generally perform better but require more memory. cross-encoder/ms-marco-MiniLM-L-6-v2 (22M parameters) runs on CPU; cross-encoder/ms-marco-MiniLM-L-12-v2.5 (118M parameters) requires GPU for acceptable latency.

Quantization reduces memory footprint at minimal accuracy cost. INT8 quantization typically preserves 95%+ of model quality while cutting memory requirements in half.

import torch
from transformers import AutoModelForSequenceClassification, AutoTokenizer

class LocalCrossEncoder:
    def __init__(self, model_name: str, device: str = None, quantize: bool = False):
        self.device = device or ('cuda' if torch.cuda.is_available() else 'cpu')
        self.tokenizer = AutoTokenizer.from_pretrained(model_name)
        
        load_kwargs = {'torch_dtype': torch.float16}
        if quantize:
            load_kwargs['load_in_8bit'] = True
        
        self.model = AutoModelForSequenceClassification.from_pretrained(
            model_name,
            device_map='auto' if self.device == 'cuda' else None,
            **load_kwargs
        )
        
        if self.device == 'cpu' and not quantize:
            self.model = self.model.to(self.device)
    
    def score(self, query: str, documents: List[str]) -> List[float]:
        """Score query-document pairs."""
        pairs = [(query, doc) for doc in documents]
        inputs = self.tokenizer(
            pairs,
            padding=True,
            truncation=True,
            return_tensors='pt',
            max_length=512
        ).to(self.device)
        
        with torch.no_grad():
            scores = self.model(**inputs).logits.squeeze(-1)
        
        return scores.cpu().numpy().tolist()
    
    def rerank(self, query: str, candidates: List[dict], top_k: int = 10) -> List[dict]:
        texts = [c.get('text', c.get('content', '')) for c in candidates]
        scores = self.score(query, texts)
        
        for candidate, score in zip(candidates, scores):
            candidate['rerank_score'] = float(score)
        
        candidates.sort(key=lambda x: x['rerank_score'], reverse=True)
        return candidates[:top_k]

Local verification checkpoint

Run the smallest example from this chapter in a local workspace and record the package version, runtime, data path, and observed output. If the result depends on model size, vector count, CPU/GPU backend, or available memory, note that constraint beside the exercise so the lesson remains reproducible.

Local verification checkpoint

Run the smallest example from this chapter in a local workspace and record the package version, runtime, data path, and observed output. If the result depends on model size, vector count, CPU/GPU backend, or available memory, note that constraint beside the exercise so the lesson remains reproducible.

EXERCISE

Benchmark a quantized vs. full-precision model on your reranking task. Measure accuracy drop vs. latency improvement.

← Chapter 9
Cross-Encoder Setup
Chapter 11 →
Two-Stage Retrieval