Local Cross-Encoder Models — RAG Systems: Part 2 (Chapter 4)

Running cross-encoders locally gives you full control, avoids API costs, and enables customization. This chapter covers setup, optimization, and practical deployment of local cross-encoders.

Why Local?

Hosting cross-encoders locally eliminates API dependency. Latency drops for simple queries (no round-trip to external service). Per-query costs vanish. For high-volume production systems, local inference is often cheaper than API pricing at sufficient scale.

The tradeoff is infrastructure management. You need to provision compute, manage model loading, and handle scaling yourself.

Model Selection for Local Deployment

Size matters for local deployment. Smaller models fit in memory easily and respond faster:

Model	Parameters	Seq Length	Speed (CPU)	Quality
cross-encoder/ms-marco-MiniLM-L-6-v2	22M	512	Fast	Good
cross-encoder/ms-marco-MiniLM-L-12-v2	118M	512	Medium	Better
BAAI/bge-reranker-base	110M	512	Medium	Good
BAAI/bge-reranker-large	278M	512	Slow	Best

Start with ms-marco-MiniLM-L-6-v2 for development and prototyping. It's small enough to run in resource-constrained environments and fast enough for real-time reranking.

Setting Up with ONNX Runtime

ONNX Runtime accelerates inference significantly over plain PyTorch for CPU-bound workloads:

from onnxruntime import InferenceSession
import numpy as np

class LocalReranker:
    def __init__(self, model_path, max_seq_length=512):
        self.sess = InferenceSession(model_path)
        self.max_seq_length = max_seq_length
    
    def predict(self, query, documents, batch_size=16):
        # Tokenize in your pipeline (simplified for example)
        scores = []
        for i in range(0, len(documents), batch_size):
            batch_docs = documents[i:i + batch_size]
            # ONNX inference call here
            # ...
        return scores

For most use cases, sentence-transformers with PyTorch is simpler. Switch to ONNX when profiling reveals inference bottleneck.

GPU Acceleration

If you have GPU capacity available, cross-encoder reranking benefits substantially from GPU acceleration:

import torch
from sentence_transformers import CrossEncoder

# Explicit GPU placement
reranker = CrossEncoder(
    'cross-encoder/ms-marco-MiniLM-L-12-v2',
    device='cuda'  # explicit device
)

# Larger batch sizes with GPU
scores = reranker.predict(pairs, batch_size=64)

GPU memory limits batch sizes. A 12-layer MiniLM model fits comfortably in 2GB VRAM with batch_size=32. Larger models require more memory.

Quantization for Smaller Models

INT8 quantization reduces memory footprint with acceptable quality loss:

from optimum.quanto import quantize, qint8
from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained(
    'cross-encoder/ms-marco-MiniLM-L-12-v2'
)
quantize(model, weights=qint8)

Quantized models load faster and use less memory. Response quality typically drops 2-5% on standard benchmarks, which is acceptable for most RAG applications.

Loading Models with Sentence-Transformers

The simplest local setup uses sentence-transformers with automatic device selection:

from sentence_transformers import CrossEncoder
import torch

# Automatic GPU/CPU detection
device = 'cuda' if torch.cuda.is_available() else 'cpu'
print(f"Using device: {device}")

reranker = CrossEncoder(
    'BAAI/bge-reranker-base',
    device=device,
    max_length=512
)

# Verify it loads correctly
test_score = reranker.predict([
    ["query example", "document example"]
])
print(f"Test score: {test_score}")

The first call downloads the model if not cached. Subsequent calls use the cached version.