Local Cross-Encoder Models — Advanced RAG — Chunking, Retrieval, Re-ranking (Chapter 10)

Running cross-encoders locally eliminates API dependencies and enables offline operation. Local models require careful selection based on hardware constraints.

Model size vs. quality: Larger models (more parameters) generally perform better but require more memory. cross-encoder/ms-marco-MiniLM-L-6-v2 (22M parameters) runs on CPU; cross-encoder/ms-marco-MiniLM-L-12-v2.5 (118M parameters) requires GPU for acceptable latency.

Quantization reduces memory footprint at minimal accuracy cost. INT8 quantization typically preserves 95%+ of model quality while cutting memory requirements in half.

import torch
from transformers import AutoModelForSequenceClassification, AutoTokenizer

class LocalCrossEncoder:
    def __init__(self, model_name: str, device: str = None, quantize: bool = False):
        self.device = device or ('cuda' if torch.cuda.is_available() else 'cpu')
        self.tokenizer = AutoTokenizer.from_pretrained(model_name)
        
        load_kwargs = {'torch_dtype': torch.float16}
        if quantize:
            load_kwargs['load_in_8bit'] = True
        
        self.model = AutoModelForSequenceClassification.from_pretrained(
            model_name,
            device_map='auto' if self.device == 'cuda' else None,
            **load_kwargs
        )
        
        if self.device == 'cpu' and not quantize:
            self.model = self.model.to(self.device)
    
    def score(self, query: str, documents: List[str]) -> List[float]:
        """Score query-document pairs."""
        pairs = [(query, doc) for doc in documents]
        inputs = self.tokenizer(
            pairs,
            padding=True,
            truncation=True,
            return_tensors='pt',
            max_length=512
        ).to(self.device)
        
        with torch.no_grad():
            scores = self.model(**inputs).logits.squeeze(-1)
        
        return scores.cpu().numpy().tolist()
    
    def rerank(self, query: str, candidates: List[dict], top_k: int = 10) -> List[dict]:
        texts = [c.get('text', c.get('content', '')) for c in candidates]
        scores = self.score(query, texts)
        
        for candidate, score in zip(candidates, scores):
            candidate['rerank_score'] = float(score)
        
        candidates.sort(key=lambda x: x['rerank_score'], reverse=True)
        return candidates[:top_k]

Local verification checkpoint

Run the smallest example from this chapter in a local workspace and record the package version, runtime, data path, and observed output. If the result depends on model size, vector count, CPU/GPU backend, or available memory, note that constraint beside the exercise so the lesson remains reproducible.