What's the minimum VRAM to run BGE Reranker v2 M3?

2GB of VRAM is enough to run BGE Reranker v2 M3 at the FP16 quantization (file size 1.1 GB). Higher-quality quantizations need more.

Can I use BGE Reranker v2 M3 commercially?

Yes — BGE Reranker v2 M3 ships under the MIT, which permits commercial use. Always read the license text before deployment.

What's the context length of BGE Reranker v2 M3?

BGE Reranker v2 M3 supports a context window of 8,192 tokens (about 8K).

BGE Reranker v2 M3 — local inference guide

Positioning

BAAI's BGE Reranker V2 M3 is the canonical companion reranker to BGE-M3 and the default open-weight cross-encoder reranker for production RAG pipelines in 2026. ~568M parameters (XLM-RoBERTa base, same architecture as BGE-M3 but trained as a cross-encoder), 8K context, multilingual coverage matching BGE-M3 (100+ languages). Released under MIT license — fully permissive commercial use. The model takes (query, document) pairs and outputs a relevance score — used as the second stage in retrieve-then-rerank pipelines after fast first-stage retrieval via BGE-M3 or other dense embedders.

Strengths

Best-in-class open-weight reranker for multilingual RAG pipelines.
Tight integration with BGE-M3: same architecture base, same multilingual coverage, designed to chain.
8K context handling matches BGE-M3 — long-document chunks rerank without truncation issues.
MIT license = unconstrained commercial use.
Small + fast. 568M parameters reranks 100s of (query, doc) pairs per second on a single GPU.
Real quality lift over no-reranker baseline. Adding BGE Reranker V2 M3 to a BGE-M3 retrieval pipeline typically improves NDCG@10 by 8-15% vs dense-only retrieval.

Limitations

Cross-encoder inference is more expensive than dense retrieval. Each (query, doc) pair requires a forward pass — only practical for re-ranking the top-N (typically 50-200) candidates from first-stage dense retrieval.
Not as strong as massive proprietary rerankers on specific English-domain tasks. Cohere Rerank 3, voyage-rerank-2, OpenAI's text-rerank API may win on English-only benchmarks.
Code reranking is not its strength. For code retrieval reranking, specialized code rerankers win.
Architecture is conservative. Newer cross-encoders may surpass on specific MTEB reranking benchmarks but BGE Reranker V2 M3 remains the default for "good enough" plus open-weight.

Real-world performance

vs Cohere Rerank 3 (API): Cohere wins on best-in-class English. BGE Reranker V2 M3 wins on cost (self-hosted), multilingual, and unconstrained commercial use.
vs voyage-rerank-2 (API): voyage-rerank-2 wins on best English domain quality; BGE Reranker V2 M3 wins on cost + multilingual.
vs no-reranker dense retrieval: 8-15% NDCG@10 improvement on most retrieval tasks. Worth the inference cost for accuracy-sensitive pipelines.
vs older bge-reranker-large: Strict upgrade with multilingual + 8K context.

Should you run this locally?

Yes if you have any RAG pipeline where retrieval quality matters. The retrieve-then-rerank pattern (BGE-M3 dense retrieval → BGE Reranker V2 M3 cross-encoder reranking → top-K to LLM context) is the canonical open-weight RAG retrieval architecture in 2026.

Pair with: BGE-M3 for first-stage dense retrieval. The combination is the default open-weight RAG retrieval stack.

How it compares

vs BGE-M3: Different roles. BGE-M3 is the dense embedder (encoder); Reranker V2 M3 is the cross-encoder reranker. Use both in a retrieve-then-rerank pipeline.
vs older bge-reranker-large: V2 M3 is the strict upgrade — multilingual, 8K context.
vs Cohere Rerank 3 (API): API wins on English; BGE wins on cost + multilingual + unconstrained license.
vs cross-encoder/ms-marco-MiniLM-L-12-v2: Older smaller cross-encoder. BGE Reranker V2 M3 strict upgrade.

Run this yourself

CPU-only: Functional via SentenceTransformers CrossEncoder API. 10-30 pairs/sec on modern CPU.
Single GPU: Any modern GPU with 4+ GB VRAM. 100-500 pairs/sec on consumer GPU.
Production: Text Embeddings Inference (TEI) supports rerankers — same serving infrastructure as embeddings.
Pipeline pattern: BGE-M3 retrieves 100 candidates → BGE Reranker V2 M3 reranks → top-10 to LLM.
Vendor: BAAI / Hugging Face: BAAI/bge-reranker-v2-m3.

Quantization	File size	VRAM required
FP16	1.1 GB	2 GB

Positioning

Strengths

Best-in-class open-weight reranker for multilingual RAG pipelines.
Tight integration with BGE-M3: same architecture base, same multilingual coverage, designed to chain.
8K context handling matches BGE-M3 — long-document chunks rerank without truncation issues.
MIT license = unconstrained commercial use.
Small + fast. 568M parameters reranks 100s of (query, doc) pairs per second on a single GPU.
Real quality lift over no-reranker baseline. Adding BGE Reranker V2 M3 to a BGE-M3 retrieval pipeline typically improves NDCG@10 by 8-15% vs dense-only retrieval.

Limitations

Cross-encoder inference is more expensive than dense retrieval. Each (query, doc) pair requires a forward pass — only practical for re-ranking the top-N (typically 50-200) candidates from first-stage dense retrieval.
Not as strong as massive proprietary rerankers on specific English-domain tasks. Cohere Rerank 3, voyage-rerank-2, OpenAI's text-rerank API may win on English-only benchmarks.
Code reranking is not its strength. For code retrieval reranking, specialized code rerankers win.
Architecture is conservative. Newer cross-encoders may surpass on specific MTEB reranking benchmarks but BGE Reranker V2 M3 remains the default for "good enough" plus open-weight.

Real-world performance

vs Cohere Rerank 3 (API): Cohere wins on best-in-class English. BGE Reranker V2 M3 wins on cost (self-hosted), multilingual, and unconstrained commercial use.
vs voyage-rerank-2 (API): voyage-rerank-2 wins on best English domain quality; BGE Reranker V2 M3 wins on cost + multilingual.
vs no-reranker dense retrieval: 8-15% NDCG@10 improvement on most retrieval tasks. Worth the inference cost for accuracy-sensitive pipelines.
vs older bge-reranker-large: Strict upgrade with multilingual + 8K context.

Should you run this locally?

Pair with: BGE-M3 for first-stage dense retrieval. The combination is the default open-weight RAG retrieval stack.

How it compares

vs BGE-M3: Different roles. BGE-M3 is the dense embedder (encoder); Reranker V2 M3 is the cross-encoder reranker. Use both in a retrieve-then-rerank pipeline.
vs older bge-reranker-large: V2 M3 is the strict upgrade — multilingual, 8K context.
vs Cohere Rerank 3 (API): API wins on English; BGE wins on cost + multilingual + unconstrained license.
vs cross-encoder/ms-marco-MiniLM-L-12-v2: Older smaller cross-encoder. BGE Reranker V2 M3 strict upgrade.

Run this yourself

CPU-only: Functional via SentenceTransformers CrossEncoder API. 10-30 pairs/sec on modern CPU.
Single GPU: Any modern GPU with 4+ GB VRAM. 100-500 pairs/sec on consumer GPU.
Production: Text Embeddings Inference (TEI) supports rerankers — same serving infrastructure as embeddings.
Pipeline pattern: BGE-M3 retrieves 100 candidates → BGE Reranker V2 M3 reranks → top-10 to LLM.
Vendor: BAAI / Hugging Face: BAAI/bge-reranker-v2-m3.

Positioning

Strengths

Limitations

Real-world performance

Should you run this locally?

How it compares

Run this yourself

Overview

Family & lineage

Strengths

Weaknesses

Quantization variants

Get the model

HuggingFace

Hardware that runs this

Frequently asked

What's the minimum VRAM to run BGE Reranker v2 M3?

Can I use BGE Reranker v2 M3 commercially?

What's the context length of BGE Reranker v2 M3?

Related — keep moving

Positioning

Strengths

Limitations

Real-world performance

Should you run this locally?

How it compares

Run this yourself

Overview

Family & lineage

Strengths

Weaknesses

Quantization variants

Get the model

HuggingFace

Hardware that runs this

Frequently asked

What's the minimum VRAM to run BGE Reranker v2 M3?

Can I use BGE Reranker v2 M3 commercially?

What's the context length of BGE Reranker v2 M3?

Related — keep moving