10. Local Cross-Encoder Models
Running cross-encoders locally eliminates API dependencies and enables offline operation. Local models require careful selection based on hardware constraints.
Model size vs. quality: Larger models (more parameters) generally perform better but require more memory. cross-encoder/ms-marco-MiniLM-L-6-v2 (22M parameters) runs on CPU; cross-encoder/ms-marco-MiniLM-L-12-v2.5 (118M parameters) requires GPU for acceptable latency.
Quantization reduces memory footprint at minimal accuracy cost. INT8 quantization typically preserves 95%+ of model quality while cutting memory requirements in half.
import torch
from transformers import AutoModelForSequenceClassification, AutoTokenizer
class LocalCrossEncoder:
def __init__(self, model_name: str, device: str = None, quantize: bool = False):
self.device = device or ('cuda' if torch.cuda.is_available() else 'cpu')
self.tokenizer = AutoTokenizer.from_pretrained(model_name)
load_kwargs = {'torch_dtype': torch.float16}
if quantize:
load_kwargs['load_in_8bit'] = True
self.model = AutoModelForSequenceClassification.from_pretrained(
model_name,
device_map='auto' if self.device == 'cuda' else None,
**load_kwargs
)
if self.device == 'cpu' and not quantize:
self.model = self.model.to(self.device)
def score(self, query: str, documents: List[str]) -> List[float]:
"""Score query-document pairs."""
pairs = [(query, doc) for doc in documents]
inputs = self.tokenizer(
pairs,
padding=True,
truncation=True,
return_tensors='pt',
max_length=512
).to(self.device)
with torch.no_grad():
scores = self.model(**inputs).logits.squeeze(-1)
return scores.cpu().numpy().tolist()
def rerank(self, query: str, candidates: List[dict], top_k: int = 10) -> List[dict]:
texts = [c.get('text', c.get('content', '')) for c in candidates]
scores = self.score(query, texts)
for candidate, score in zip(candidates, scores):
candidate['rerank_score'] = float(score)
candidates.sort(key=lambda x: x['rerank_score'], reverse=True)
return candidates[:top_k]
Local verification checkpoint
Run the smallest example from this chapter in a local workspace and record the package version, runtime, data path, and observed output. If the result depends on model size, vector count, CPU/GPU backend, or available memory, note that constraint beside the exercise so the lesson remains reproducible.
Local verification checkpoint
Run the smallest example from this chapter in a local workspace and record the package version, runtime, data path, and observed output. If the result depends on model size, vector count, CPU/GPU backend, or available memory, note that constraint beside the exercise so the lesson remains reproducible.
Benchmark a quantized vs. full-precision model on your reranking task. Measure accuracy drop vs. latency improvement.