RUNLOCALAIv38
→WILL IT RUNBEST GPUCOMPARETROUBLESHOOTSTARTPULSEMODELSHARDWARETOOLSBENCH
RUNLOCALAI

Operator-grade instrument for local-AI hardware intelligence. Hand-written verdicts. Real benchmarks. Reproducible commands.

OP·Fredoline Eruo
DIR
  • Models
  • Hardware
  • Tools
  • Benchmarks
  • Will it run?
GUIDES
  • Best GPU
  • Best laptop
  • Best Mac
  • Best used GPU
  • Best budget GPU
  • Best GPU for Ollama
  • Best GPU for SD
  • AI PC build $2K
  • CUDA vs ROCm
  • 16 vs 24 GB
  • Compare hardware
  • Custom compare
REF
  • Systems
  • Ecosystem maps
  • Pillar guides
  • Methodology
  • Glossary
  • Errors KB
  • Troubleshooting
  • Resources
  • Public API
EDITOR
  • About
  • About the author
  • Changelog
  • Latest
  • Updates
  • Submit benchmark
  • Send feedback
  • Trust
  • Editorial policy
  • How we make money
  • Contact
LEGAL
  • Privacy
  • Terms
  • Sitemap
MAIL · MONTHLY DIGEST
Get monthly local AI changes
Monthly recap. No spam.
DISCLOSURE

Some links on this site are affiliate links (Amazon Associates and other first-class retailers). When you buy through them, we earn a small commission at no extra cost to you. Affiliate links do not influence our verdicts — there are cards we rate highly that we don't have affiliate relationships with, and cards that sell well that we refuse to recommend. Read more →

SYS · ONLINEUPTIME · 100%2026 · operator-owned
RUNLOCALAI · v38
Tasks/RAG & Search/Document Reranking
RAG & Search
cross encoder
rerank
relevance ranking

Document Reranking

Cross-encoder reranking of retrieved documents for relevance. BGE Reranker V2 M3 + Cohere Rerank are the leaders.

Capability notes

Reranking is the second stage of two-stage retrieval: fast first-stage retriever (embedding + vector search) returns 50–200 candidates with high recall but moderate precision, then a reranker cross-encodes each (query, document) pair and assigns a relevance score — reordering by precision. [BGE Reranker V2 M3](/models/bge-reranker-v2-m3) (BAAI, 568M params, 8192 token context, MIT license) is the canonical open-weight reranker. The accuracy gain is substantial. [BGE-M3](/models/bge-m3) first-stage dense search NDCG@10 = 65.8. Adding BGE Reranker V2 M3 on top-100 raises NDCG@10 to 71.4 — an 8.5% relative gain, moving retrieval from "good enough for casual search" to "production-grade for legal/medical/financial retrieval." The reranker catches false positives the embedder's cosine similarity misranks — documents topically adjacent but irrelevant to the query. When reranking matters: (1) precision-sensitive applications (legal — missing a case is malpractice risk; medical — missing contraindication study is liability), (2) high-document-count retrieval (1M+ documents where cosine similarity clusters thousands around common topics), (3) complex queries where embedders struggle with multi-constraint semantics. When reranking doesn't help: simple keyword queries where first-stage already returns rank-1, corpora under 1,000 documents, or latency-critical applications where reranker's 10–30ms per candidate is too slow. The reranker-embedder relationship: a reranker trained on different data than the embedder can disagree in ways that degrade retrieval. BGE Reranker V2 M3 is trained to complement [BGE-M3](/models/bge-m3) specifically — using them together is the designed path. Mixing OpenAI embeddings with BGE Reranker works but creates edge cases where the reranker disagrees with first-stage results inconsistently.

If you just want to try this

Lowest-friction path to a working setup.

Deploy [BGE Reranker V2 M3](/models/bge-reranker-v2-m3) with one Docker command via [Text Embeddings Inference (TEI)](/tools/text-embeddings-inference): ```bash docker run -p 8080:80 --gpus all -e MODEL_ID=BAAI/bge-reranker-v2-m3 ghcr.io/huggingface/text-embeddings-inference:latest ``` Send query + candidates to `/rerank`: ```bash curl http://localhost:8080/rerank -X POST -H "Content-Type: application/json" -d '{"query": "What is the warranty period?", "texts": ["2-year warranty covers...", "Shipping takes 3-5 days...", "Returns within 30 days..."]}' ``` Returns scored, sorted results: ```json [{"index": 0, "score": 0.92}, {"index": 2, "score": 0.45}, {"index": 1, "score": 0.12}] ``` Hardware: 568M params (~1.1 GB VRAM FP16). Any GPU with 4 GB+ VRAM ([RTX 3060 12GB](/hardware/rtx-3060-12gb), [RTX 4060 Ti 16GB](/hardware/rtx-4060-ti-16gb)). On CPU: 5–15 rerankings/sec — viable for low-volume with <50 candidates per query. For a complete Python RAG pipeline with reranking: ```python import httpx async def rerank(query, docs): async with httpx.AsyncClient() as client: resp = await client.post("http://localhost:8080/rerank", json={"query": query, "texts": docs}) return resp.json() candidates = vector_db.search(query_embed, top_k=100) # After first-stage reranked = await rerank(query, [c.text for c in candidates]) top5 = sorted(reranked, key=lambda x: x["score"], reverse=True)[:5] ``` The reranker adds ~10–30ms per document. Top-100 candidates: 1–3 seconds on CPU or 100–300ms on GPU (batch inference). Acceptable when quality improvement justifies latency.

For production deployment

Operator-grade recommendation.

Production two-stage retrieval: fast first-stage retriever + reranker for precision, with a latency budget constraining reranking depth. **Pipeline architecture.** Stage 1: embed query via [BGE-M3](/models/bge-m3) on TEI (5–20ms) → HNSW vector search for top-100 (1–10ms). Stage 2: rerank top-100 via [BGE Reranker V2 M3](/models/bge-reranker-v2-m3) on TEI (100–300ms GPU batch=100). Total: 150–400ms. NDCG@10 improves from 65.8 to 71.4 — 8.5% gain. **Latency budget.** Interactive (<200ms total): rerank top-20 (20 × 10ms = 200ms). Batch (<2s): rerank top-200. Index-time enrichment (no constraint): rerank top-1,000 per document, store reranked order. Cap reranking at 50 for interactive, 200 for batch unless benchmarks show quality improvement beyond those depths. **Batch reranking.** TEI supports batch reranking — multiple (query, docs) pairs in one request for max GPU utilization. Queue individual requests for 10–50ms to batch into single inference pass. 10 queries × 100 candidates = 1,000 pairs processed in ~200–400ms on [RTX 4090](/hardware/rtx-4090) — effective throughput of 2,500–5,000 pairs/sec. **Score calibration.** Reranker scores are not calibrated probabilities — 0.8 doesn't mean "80% chance of relevance." Scores are relative within a batch. For applications needing consistent thresholds, calibrate against labeled dataset: run reranker on labeled query-document corpus, map raw scores to precision-at-k curves, set per-application thresholds based on observed precision. **Two-tier reranking for scale (100M+ documents).** Stage 1: dense embedding → top-1,000. Stage 2: lightweight lexical (BM25 via [BGE-M3](/models/bge-m3) sparse embeddings) → top-100. Stage 3: cross-encoder reranker → top-10. NDCG@10 of ~73 (1.6 points above two-stage) for high-scale retrieval. **When NOT to rerank.** Skip when: corpus under 10,000 docs (first-stage precise enough), queries are keyword/exact-match, latency budget under 50ms, or first-stage quality meets needs (general FAQ, internal wiki). Measure first-stage NDCG before engineering a reranking stage.

What breaks

Failure modes operators see in the wild.

**Reranker latency bottleneck.** Symptom: adding reranker increases retrieval from 20ms to 500ms+. Cause: reranking 200 candidates synchronously one at a time (200 × 10ms = 2,000ms). Reranker's cross-encoder processes each pair through full transformer — no benefit from pre-computed embeddings. Mitigation: use TEI batch inference — send all candidates in one request, processed in parallel on GPU. Caps 200-document reranking at 100–300ms. Cap candidate count at 50 for interactive. For extreme latency-sensitivity, lightweight bi-encoder reranker (DistilBERT-based) at 1–2ms per document — 80% quality for 20% latency. **Score calibration drift.** Symptom: same query-document pair scores 0.85 today, 0.72 tomorrow after model update or candidate pool change. Cause: reranker scores are relative to batch — adding highly-relevant documents pushes down moderate scores. Mitigation: never rely on absolute scores for thresholds. Use rank ordering (top-k). If thresholds required, calibrate against fixed reference set scored periodically. **Cross-encoder context window truncation.** Symptom: reranker assigns low scores to clearly relevant documents because key passage truncated at 8192-token limit. Cause: query + document combined exceed 8192 — document truncated from end. Mitigation: truncate documents before sending, not after. For long documents, chunk into 3,000-token segments, rerank each, use max segment score as document score. "Max-pooling over chunks" preserves ability to find relevant passages anywhere. **Reranker-embedder model mismatch.** Symptom: reranker and embedder disagree — embedder ranks A above B, reranker flips, final quality degrades. Cause: embedders optimize semantic similarity; rerankers optimize passage relevance — related but distinct objectives. Mitigation: use matched pairs — BGE-M3 + BGE Reranker V2 M3 is the designed pair. If different embedder, evaluate agreement rate (reranker top-10 in embedder top-50). Below 60% indicates mismatch. **Reranking irrelevant candidates.** Symptom: reranker assigns moderate scores (0.6–0.8) to completely irrelevant documents — cross-encoder optimized for relative ordering within batch, not absolute relevance detection. If top-100 are all irrelevant, reranker still assigns "best" to least-worst — cannot detect all candidates wrong. Mitigation: minimum first-stage score threshold — if BGE-M3 cosine similarity below 0.4 (1024-dim), document is too distant regardless of reranker output.

Hardware guidance

Reranker hardware requirements are the lowest after embeddings. BGE Reranker V2 M3 (568M params, ~1.1 GB FP16) runs on CPU at production latency for low volume. **CPU-only ($0).** Throughput on modern desktop: 5–15 pairs/sec sequential, 20–40 pairs/sec with TEI batching. For 10 queries/hour with top-50 reranking each, CPU is fine. 100 queries/hour with top-50: ~20% CPU utilization. CPU viable for batch/preprocessing with flexible latency. [Apple M4 Pro](/hardware/apple-m4-pro) via CoreML: 8–12 pairs/sec. **Entry GPU ($300–600).** [RTX 3060 12GB](/hardware/rtx-3060-12gb): 50–100 pairs/sec at batch=32. [RTX 4060 Ti 16GB](/hardware/rtx-4060-ti-16gb): 80–150 pairs/sec. Top-50 reranking: 0.3–1 second per query on $300 GPU — acceptable for interactive search. **SMB tier ($1,500–2,500).** [RTX 4090](/hardware/rtx-4090): 200–400 pairs/sec — top-100 in 250–500ms. [RTX 5080](/hardware/rtx-5080): 150–300 pairs/sec. At this tier, reranking is imperceptible — latency dominated by network, not inference. **Enterprise ($8,000+).** Overkill — 1.1 GB VRAM model on enterprise GPU leaves 95%+ VRAM idle. [L40S](/hardware/nvidia-l40s) at 48 GB: ~500 pairs/sec at 2.3% VRAM utilization. Co-deploy reranker on same GPU as embedding or generation model — reranker's tiny footprint cohabitates without contention. **Co-deployment.** Typical RAG server: BGE-M3 (1.1 GB) + BGE Reranker (1.1 GB) + 7B generation (4–8 GB) on single [RTX 4060 Ti 16GB](/hardware/rtx-4060-ti-16gb) — 6–10 GB used, 6–10 GB headroom. For 70B RAG, generation dominates VRAM but embedder + reranker add negligible 2.2 GB. Reranking infrastructure cost is near-zero when co-deployed. **Latency scaling.** BGE Reranker V2 M3 per-pair at batch=1: CPU 50–200ms, GPU 5–15ms. At batch=50: CPU sequential 2.5–10 seconds; GPU parallel 50–150ms. GPU advantage is nonlinear — single GPU handles 50–100× more throughput than single CPU core.

Runtime guidance

**Text Embeddings Inference (TEI) with reranker support — the only production reranker serving path.** [Text Embeddings Inference (TEI)](/tools/text-embeddings-inference) serves both embeddings and reranking from one Docker container by switching MODEL_ID. For reranking: `MODEL_ID=BAAI/bge-reranker-v2-m3`. The `/rerank` endpoint accepts `{"query": "...", "texts": [...]}` and returns scored results. Batch inference processes all pairs in parallel — 100 candidates in one request vs 100 sequential requests = 2–3× throughput. TEI is the only production-grade open-weight reranker serving option. No Ollama support for reranking. llama.cpp does not serve rerankers. sentence-transformers can load the model (`CrossEncoder('BAAI/bge-reranker-v2-m3')`) for programmatic use but provides no serving layer — you build the HTTP API yourself. **Vector database integration.** [Qdrant](/tools/qdrant) supports native reranking — pass reranker endpoint URL in config, Qdrant automatically reranks vector search results. Simplest path: deploy Qdrant for search, point at TEI reranker, Qdrant handles two-stage retrieval internally. [pgvector](/tools/pgvector): no native reranker integration. Implement in app code: Postgres `SELECT ... ORDER BY embedding <=> $1 LIMIT 100` → TEI `/rerank` → reorder. Straightforward but requires app-layer orchestration. [Weaviate](/tools/weaviate): reranker modules via module system — configure `reranker-transformers` pointing at TEI in config YAML. Exposes reranked search via GraphQL parameter. **Decision tree.** Simplest production: TEI (reranker) + Qdrant (native integration) — one API call for search + rerank. Infrastructure-minimal: TEI + pgvector — app code orchestrates two-stage pipeline. Multi-tenant hybrid: TEI + Weaviate — native multi-tenancy + hybrid search + reranking in one GraphQL query. **When to add reranking.** Measure first-stage NDCG@10. If quality meets requirements, reranking adds unnecessary complexity. If 5%+ below target, add reranking and measure improvement — typical gain is 5–15% on NDCG@10. Reranking never degrades quality (reorders, doesn't remove). Cost is latency and infrastructure. Deploy when quality gain justifies latency budget and co-deployment on existing GPU infrastructure at near-zero marginal hardware cost.

Setup walkthrough

  1. Install Ollama + pip install chromadb.
  2. Pull a reranker: ollama pull bge-reranker-v2-m3 or use TEI via Docker: docker run -p 8080:80 ghcr.io/huggingface/text-embeddings-inference:latest --model-id BAAI/bge-reranker-v2-m3.
  3. Python pipeline: embed documents with Nomic Embed Text → retrieve top-20 via cosine similarity → rerank top-20 with BGE Reranker V2 M3 → return top-5:
import requests
docs = ["doc1 text...", "doc2 text..."]  # top-20 from vector search
query = "How do I set up local STT?"
resp = requests.post("http://localhost:11434/api/rerank",
    json={"model": "bge-reranker-v2-m3", "query": query, "documents": docs})
top5 = resp.json()["results"][:5]
  1. First reranked result in <500 ms for 20 documents. The reranker reads query+document pairs jointly (cross-encoder) — far more accurate than embedding similarity alone.

The cheap setup

BGE Reranker V2 M3 (568M params) runs on CPU at ~10-30 documents/second — enough for RAG pipelines where you rerank top-20 or top-50 results. Any $300 laptop handles this. For higher throughput: a used GTX 1060 6 GB ($60) runs at 200-500 docs/second. Reranking is lightweight (cross-encoder is one forward pass per query-doc pair) — the bottleneck is usually the upstream embedding retrieval, not the reranker.

The serious setup

Used RTX 3060 12 GB ($200-250, see /hardware/rtx-3060-12gb) handles production reranking at 1,000-2,000 documents/second — can rerank the entire top-1000 for every query in real-time. For enterprise search serving 100s of concurrent users, an RTX 3090 24 GB ($700-900, see /hardware/rtx-3090) with HuggingFace TEI in Docker provides 3,000-5,000 docs/second with batching. Total build: ~$900-1,100. Reranking is light — the same GPU that runs embeddings also runs the reranker.

Common beginner mistake

The mistake: Skipping reranking entirely — retrieving top-5 directly from embedding similarity and accepting the quality hit. Why it fails: Embedding models compress all semantic meaning into a single vector — the query "bank" means either "river bank" or "financial bank," and the embedding can't distinguish at retrieval time. Embedding-only retrieval typically gets the right document in top-20 but not top-5. The fix: Always use a two-stage pipeline: (1) retrieve 20-50 candidates via embedding similarity (fast, cheap), (2) rerank with a cross-encoder (slower but reads query+doc jointly). This gives ~20-40% higher recall@5. The 500ms reranking step on 20 docs is worth the quality jump every time.

Recommended setup for document reranking

Recommended hardware
Best GPU for local AI →
All workloads ranked across VRAM tiers.
Recommended runtimes

Browse all tools for runtimes that fit this workload.

Budget build
AI PC under $1,000 →
Best GPU for this task
Best GPU for local AI →

Reality check

Local AI workloads have real hardware constraints that vary by task type. VRAM ceiling decides what model fits; bandwidth decides decode speed; compute decides prefill speed. Pick the GPU tier that fits your actual workload, not the spec sheet.

Common mistakes

  • Buying for spec-sheet VRAM without modeling KV cache + activation overhead
  • Underestimating quantization quality loss below Q4
  • Skipping flash-attention support (real perf gap on long context)
  • Ignoring sustained-load thermals (laptops thermal-throttle within 30 min)

What breaks first

The errors most operators hit when running document reranking locally. Each links to a diagnose+fix walkthrough.

  • CUDA out of memory →
  • Model keeps crashing →
  • Ollama running slow →
  • llama.cpp too slow →

Before you buy

Verify your specific hardware can handle document reranking before committing money.

  • Will it run on my hardware? →
  • Custom compatibility check →
  • GPU recommender (4 questions) →

Featured models

BGE Reranker v2 M3

Related tasks

Text Embeddings
Buyer guides
  • Best GPU for local AI →
  • Best laptop for local AI →
  • Best Mac for local AI →
  • Best used GPU for local AI →
  • Will it run on my hardware? →
Compare hardware
  • Curated head-to-heads →
  • Custom comparison tool →
  • RTX 4090 vs RTX 5090 →
  • RTX 3090 vs RTX 4090 →
Troubleshooting
  • CUDA out of memory →
  • Ollama running slowly →
  • ROCm not detected →
  • Model keeps crashing →
Specialized buyer guides
  • GPU for ComfyUI (image-gen) →
  • GPU for KoboldCpp (RP/long-context) →
  • GPU for AI agents →
  • GPU for local OCR →
  • GPU for voice cloning →
  • Upgrade from RTX 3060 →
  • Beginner setup →
  • AI PC for students →
Updated 2026 roundup
  • Best free local AI tools (2026) →