04. Local Cross-Encoder Models
Running cross-encoders locally gives you full control, avoids API costs, and enables customization. This chapter covers setup, optimization, and practical deployment of local cross-encoders.
Why Local?
Hosting cross-encoders locally eliminates API dependency. Latency drops for simple queries (no round-trip to external service). Per-query costs vanish. For high-volume production systems, local inference is often cheaper than API pricing at sufficient scale.
The tradeoff is infrastructure management. You need to provision compute, manage model loading, and handle scaling yourself.
Model Selection for Local Deployment
Size matters for local deployment. Smaller models fit in memory easily and respond faster:
| Model | Parameters | Seq Length | Speed (CPU) | Quality |
|---|---|---|---|---|
| cross-encoder/ms-marco-MiniLM-L-6-v2 | 22M | 512 | Fast | Good |
| cross-encoder/ms-marco-MiniLM-L-12-v2 | 118M | 512 | Medium | Better |
| BAAI/bge-reranker-base | 110M | 512 | Medium | Good |
| BAAI/bge-reranker-large | 278M | 512 | Slow | Best |
Start with ms-marco-MiniLM-L-6-v2 for development and prototyping. It's small enough to run in resource-constrained environments and fast enough for real-time reranking.
Setting Up with ONNX Runtime
ONNX Runtime accelerates inference significantly over plain PyTorch for CPU-bound workloads:
from onnxruntime import InferenceSession
import numpy as np
class LocalReranker:
def __init__(self, model_path, max_seq_length=512):
self.sess = InferenceSession(model_path)
self.max_seq_length = max_seq_length
def predict(self, query, documents, batch_size=16):
# Tokenize in your pipeline (simplified for example)
scores = []
for i in range(0, len(documents), batch_size):
batch_docs = documents[i:i + batch_size]
# ONNX inference call here
# ...
return scores
For most use cases, sentence-transformers with PyTorch is simpler. Switch to ONNX when profiling reveals inference bottleneck.
GPU Acceleration
If you have GPU capacity available, cross-encoder reranking benefits substantially from GPU acceleration:
import torch
from sentence_transformers import CrossEncoder
# Explicit GPU placement
reranker = CrossEncoder(
'cross-encoder/ms-marco-MiniLM-L-12-v2',
device='cuda' # explicit device
)
# Larger batch sizes with GPU
scores = reranker.predict(pairs, batch_size=64)
GPU memory limits batch sizes. A 12-layer MiniLM model fits comfortably in 2GB VRAM with batch_size=32. Larger models require more memory.
Quantization for Smaller Models
INT8 quantization reduces memory footprint with acceptable quality loss:
from optimum.quanto import quantize, qint8
from transformers import AutoModelForSequenceClassification
model = AutoModelForSequenceClassification.from_pretrained(
'cross-encoder/ms-marco-MiniLM-L-12-v2'
)
quantize(model, weights=qint8)
Quantized models load faster and use less memory. Response quality typically drops 2-5% on standard benchmarks, which is acceptable for most RAG applications.
Loading Models with Sentence-Transformers
The simplest local setup uses sentence-transformers with automatic device selection:
from sentence_transformers import CrossEncoder
import torch
# Automatic GPU/CPU detection
device = 'cuda' if torch.cuda.is_available() else 'cpu'
print(f"Using device: {device}")
reranker = CrossEncoder(
'BAAI/bge-reranker-base',
device=device,
max_length=512
)
# Verify it loads correctly
test_score = reranker.predict([
["query example", "document example"]
])
print(f"Test score: {test_score}")
The first call downloads the model if not cached. Subsequent calls use the cached version.
Install sentence-transformers, download a cross-encoder model, and implement reranking for a set of 50 candidate documents. Measure wall-clock time for batch sizes of 8, 16, 32, and 64. Identify the optimal batch size for your hardware.