BGE M3
BAAI's multilingual embedding flagship. Dense + sparse + ColBERT-style multi-vector. The de-facto open multilingual embedding pick.
Positioning
BAAI's BGE-M3 (Multi-Functionality, Multi-Linguality, Multi-Granularity) is the canonical open-weight embedding model in 2026 — the model that essentially replaced OpenAI text-embedding-ada-002 as the default for self-hosted RAG pipelines. ~568M parameters (XLM-RoBERTa base architecture), 8192 token context, supports 100+ languages. Released under MIT license — fully permissive commercial use. The model produces three output formats simultaneously: dense embeddings (1024-dim), multi-vector embeddings (ColBERT-style late interaction), and sparse lexical embeddings — making it uniquely flexible for hybrid retrieval pipelines.
Strengths
- Best-in-class multilingual retrieval. Genuinely strong on 100+ languages — Arabic, Chinese, Japanese, Korean, Russian, Spanish, French, German, Hindi all well-supported.
- 8K context is uncommon for embeddings. Most open-weight embedders cap at 512 tokens; BGE-M3's 8K window enables long-document chunk retrieval without aggressive splitting.
- Three retrieval modes simultaneously. Dense + multi-vector + sparse from one forward pass — your pipeline can hybrid-rank without running multiple models.
- MIT license = unconstrained commercial use.
- Small + fast. 568M parameters runs at 1000+ docs/second on single CPU + GPU, no expensive serving infrastructure needed.
- Strong on the MTEB benchmark for retrieval, similarity, and classification — competitive with much larger embedding models.
Limitations
- Not as strong as massive embedding models on specific English-only domain tasks. OpenAI text-embedding-3-large + Cohere embed-english-v3.0 still win on MTEB English subset.
- Code embeddings are not its strength. For code retrieval, voyage-code-3 or specialized code embedders win.
- Reranker is a separate model. BGE Reranker V2 M3 is the canonical companion reranker — pipelines need both for best results.
- Older XLM-RoBERTa base means architecture is conservative — newer transformer-based embedders may surpass on specific benchmarks.
Real-world performance
- vs OpenAI text-embedding-3-small (API): BGE-M3 is competitive on multilingual + comparable on English at ~free self-hosted vs $0.02/1M tokens API. Self-hosted economics dominate at any scale.
- vs Cohere embed-multilingual-v3.0 (API): Comparable multilingual quality, BGE-M3 wins on cost (self-hosted) and 8K context.
- vs e5-large-v2: Older open-weight embedder. BGE-M3 strict upgrade on multilingual + context length.
- vs voyage-3-lite (API): Voyage AI wins on English domain-specific quality but BGE-M3 wins on cost + multilingual + flexibility.
Should you run this locally?
Yes if you have any RAG / search / similarity / classification pipeline. BGE-M3 is the canonical answer for "what embedding model should I self-host" in 2026 — there is essentially no scenario where you should pay OpenAI / Cohere embedding API fees instead of running BGE-M3 unless you specifically need the very-best English-only performance and money is no object.
Pair with: BGE Reranker V2 M3 for retrieve-then-rerank pipelines. The combination is the canonical open-weight RAG retrieval stack.
How it compares
- vs BGE Reranker V2 M3: Different roles. BGE-M3 is the encoder/embedder; Reranker V2 is the cross-encoder reranker. Use both in a retrieve-then-rerank pipeline.
- vs older bge-large-en: BGE-M3 is the strict upgrade — multilingual, longer context, three modes simultaneously.
- vs e5-mistral-7b-instruct: e5-mistral-7b is a 7B-parameter LLM-based embedder — much heavier inference, marginal quality wins.
- vs OpenAI text-embedding-3-large (API): API wins on best English quality; BGE-M3 wins on cost + multilingual + open-weight.
Run this yourself
- CPU-only: Functional via llama.cpp or SentenceTransformers. ~50-150 docs/sec on modern CPU.
- Single GPU: Any modern GPU with 4+ GB VRAM. ~1000-3000 docs/sec on consumer GPU.
- vLLM not the right tool — embeddings serve well via Text Embeddings Inference (TEI) by Hugging Face.
- Production: TEI server + your favorite vector DB (Qdrant, pgvector, Weaviate).
- Vendor: BAAI / Hugging Face: BAAI/bge-m3.
Overview
BAAI's multilingual embedding flagship. Dense + sparse + ColBERT-style multi-vector. The de-facto open multilingual embedding pick.
Family & lineage
How this model relates to others in its lineage. Family members share architecture and training-data roots; parent / children edges record direct distillation or fine-tune relationships.
Strengths
- MIT license
- Multilingual
- Dense + sparse + multi-vector
Weaknesses
- No instruction-tuned variant
Quantization variants
Each quantization trades model quality for file size and VRAM. Q4_K_M is the most popular starting point.
| Quantization | File size | VRAM required |
|---|---|---|
| FP16 | 1.1 GB | 2 GB |
Get the model
HuggingFace
Original weights
Source repository — direct quantization required.
Hardware that runs this
Cards with enough VRAM for at least one quantization of BGE M3.
Models worth comparing
Same parameter band, plus what's one tier above and below — so you can decide what actually fits your hardware.
Frequently asked
What's the minimum VRAM to run BGE M3?
Can I use BGE M3 commercially?
What's the context length of BGE M3?
Source: huggingface.co/BAAI/bge-m3
Reviewed by RunLocalAI Editorial. See our editorial policy for how we research and verify model claims.
Related — keep moving
Verify BGE M3 runs on your specific hardware before committing money.