RUNLOCALAIv38
->Will it run?Best GPUCompareTroubleshootStartLearnPulseModelsHardwareToolsBench
Run check
RUNLOCALAI

Independently operated catalog for local-AI hardware and software. Hand-written verdicts. Source-cited claims. Reproducible commands when we have them.

OP·Fredoline Eruo
DIR
  • Models
  • Hardware
  • Tools
  • Benchmarks
TOOLS
  • Will it run?
  • Compare hardware
  • Cost vs cloud
  • Choose my GPU
  • Prompting kits
  • Quick answers
REF
  • All buyer guides
  • Learn local AI
  • Methodology
  • Glossary
  • Errors KB
  • Trust
EDITOR
  • About
  • Author
  • How we make money
  • Editorial policy
  • Contact
LEGAL
  • Privacy
  • Terms
  • Sitemap
MAIL · MONTHLY DIGEST
Get monthly local AI changes
Monthly recap. No spam.
DISCLOSURE

Some links on this site are affiliate links (Amazon Associates and other first-class retailers). When you buy through them, we earn a small commission at no extra cost to you. Affiliate links do not influence our verdicts — there are cards we rate highly that we don't have affiliate relationships with, and cards that sell well that we refuse to recommend. Read more →

© 2026 runlocalai.coIndependently operated
RUNLOCALAI · v38
  1. >
  2. Home
  3. /Learn
  4. /Courses
  5. /RAG Systems: Part 2
  6. /Ch. 4
RAG Systems: Part 2

04. Local Cross-Encoder Models

Chapter 4 of 22 · 25 min
KEY INSIGHT

Local cross-encoders give full control and eliminate API costs, but require careful model selection and optimization for production throughput.

Running cross-encoders locally gives you full control, avoids API costs, and enables customization. This chapter covers setup, optimization, and practical deployment of local cross-encoders.

Why Local?

Hosting cross-encoders locally eliminates API dependency. Latency drops for simple queries (no round-trip to external service). Per-query costs vanish. For high-volume production systems, local inference is often cheaper than API pricing at sufficient scale.

The tradeoff is infrastructure management. You need to provision compute, manage model loading, and handle scaling yourself.

Model Selection for Local Deployment

Size matters for local deployment. Smaller models fit in memory easily and respond faster:

Model Parameters Seq Length Speed (CPU) Quality
cross-encoder/ms-marco-MiniLM-L-6-v2 22M 512 Fast Good
cross-encoder/ms-marco-MiniLM-L-12-v2 118M 512 Medium Better
BAAI/bge-reranker-base 110M 512 Medium Good
BAAI/bge-reranker-large 278M 512 Slow Best

Start with ms-marco-MiniLM-L-6-v2 for development and prototyping. It's small enough to run in resource-constrained environments and fast enough for real-time reranking.

Setting Up with ONNX Runtime

ONNX Runtime accelerates inference significantly over plain PyTorch for CPU-bound workloads:

from onnxruntime import InferenceSession
import numpy as np

class LocalReranker:
    def __init__(self, model_path, max_seq_length=512):
        self.sess = InferenceSession(model_path)
        self.max_seq_length = max_seq_length
    
    def predict(self, query, documents, batch_size=16):
        # Tokenize in your pipeline (simplified for example)
        scores = []
        for i in range(0, len(documents), batch_size):
            batch_docs = documents[i:i + batch_size]
            # ONNX inference call here
            # ...
        return scores

For most use cases, sentence-transformers with PyTorch is simpler. Switch to ONNX when profiling reveals inference bottleneck.

GPU Acceleration

If you have GPU capacity available, cross-encoder reranking benefits substantially from GPU acceleration:

import torch
from sentence_transformers import CrossEncoder

# Explicit GPU placement
reranker = CrossEncoder(
    'cross-encoder/ms-marco-MiniLM-L-12-v2',
    device='cuda'  # explicit device
)

# Larger batch sizes with GPU
scores = reranker.predict(pairs, batch_size=64)

GPU memory limits batch sizes. A 12-layer MiniLM model fits comfortably in 2GB VRAM with batch_size=32. Larger models require more memory.

Quantization for Smaller Models

INT8 quantization reduces memory footprint with acceptable quality loss:

from optimum.quanto import quantize, qint8
from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained(
    'cross-encoder/ms-marco-MiniLM-L-12-v2'
)
quantize(model, weights=qint8)

Quantized models load faster and use less memory. Response quality typically drops 2-5% on standard benchmarks, which is acceptable for most RAG applications.

Loading Models with Sentence-Transformers

The simplest local setup uses sentence-transformers with automatic device selection:

from sentence_transformers import CrossEncoder
import torch

# Automatic GPU/CPU detection
device = 'cuda' if torch.cuda.is_available() else 'cpu'
print(f"Using device: {device}")

reranker = CrossEncoder(
    'BAAI/bge-reranker-base',
    device=device,
    max_length=512
)

# Verify it loads correctly
test_score = reranker.predict([
    ["query example", "document example"]
])
print(f"Test score: {test_score}")

The first call downloads the model if not cached. Subsequent calls use the cached version.

EXERCISE

Install sentence-transformers, download a cross-encoder model, and implement reranking for a set of 50 candidate documents. Measure wall-clock time for batch sizes of 8, 16, 32, and 64. Identify the optimal batch size for your hardware.

← Chapter 3
Cross-Encoder Setup
Chapter 5 →
Reranking Pipeline