RUNLOCALAIv38
→WILL IT RUNBEST GPUCOMPARETROUBLESHOOTSTARTPULSEMODELSHARDWARETOOLSBENCH
RUNLOCALAI

Operator-grade instrument for local-AI hardware intelligence. Hand-written verdicts. Real benchmarks. Reproducible commands.

OP·Fredoline Eruo
DIR
  • Models
  • Hardware
  • Tools
  • Benchmarks
  • Will it run?
GUIDES
  • Best GPU
  • Best laptop
  • Best Mac
  • Best used GPU
  • Best budget GPU
  • Best GPU for Ollama
  • Best GPU for SD
  • AI PC build $2K
  • CUDA vs ROCm
  • 16 vs 24 GB
  • Compare hardware
  • Custom compare
REF
  • Systems
  • Ecosystem maps
  • Pillar guides
  • Methodology
  • Glossary
  • Errors KB
  • Troubleshooting
  • Resources
  • Public API
EDITOR
  • About
  • About the author
  • Changelog
  • Latest
  • Updates
  • Submit benchmark
  • Send feedback
  • Trust
  • Editorial policy
  • How we make money
  • Contact
LEGAL
  • Privacy
  • Terms
  • Sitemap
MAIL · MONTHLY DIGEST
Get monthly local AI changes
Monthly recap. No spam.
DISCLOSURE

Some links on this site are affiliate links (Amazon Associates and other first-class retailers). When you buy through them, we earn a small commission at no extra cost to you. Affiliate links do not influence our verdicts — there are cards we rate highly that we don't have affiliate relationships with, and cards that sell well that we refuse to recommend. Read more →

SYS · ONLINEUPTIME · 100%2026 · operator-owned
RUNLOCALAI · v38
Families/Embeddings & Retrieval/BGE (BAAI General Embedding)
Embeddings & Retrieval
Open-weight
MIT

BGE (BAAI General Embedding)

by BAAI (Beijing Academy of AI)

BAAI's open-weight embedding family. BGE-M3 is the canonical multilingual embedding model in 2026 (100+ languages, 8K context); BGE Reranker V2 M3 is the canonical companion cross-encoder reranker. The default open-weight RAG retrieval stack.

Best entry point for local use

Start with BGE-M3 via sentence-transformers on any GPU — BGE-M3 is the best open-weight multilingual embedding model, generating 1024-dim embeddings that support dense retrieval, sparse (lexical) retrieval, and multi-vector (ColBERT-style) retrieval in a single model. It covers 100+ languages and achieves the highest MTEB retrieval score among open-weight embedding models at its size (568M params). The model is small: FP16 ~1.1 GB VRAM — runs on any GPU with 2 GB+ VRAM including integrated graphics. For English-only retrieval, BGE-large-en-v1.5 (335M params, 1024-dim) outperforms BGE-M3 on English MTEB by ~1.5 points at half the size. For re-ranking, BGE-Reranker-v2-m3 is the companion cross-encoder re-ranker — pair M3 retrieval + M3 re-rank for two-stage retrieval pipelines. MIT license.

Deployment guidance

For single-user RAG: sentence-transformers + BGE-M3 FP16 on RTX 3060 12GB — ~200 docs/second encoding throughput, 8K token max input. For production serving: Infinity embedding server with BGE-M3 on L4 24 GB — serves ~500 embeddings/second at batch 32 with continuous batching. For CPU-only: llama.cpp embedding server with BGE-M3 GGUF Q8_0 on Apple M3 — ~80 embeddings/second via Metal. For multi-stage RAG: deploy BGE-M3 for first-pass dense retrieval, then BGE-Reranker-v2-m3 as cross-encoder re-ranker on top-100 candidates — this two-stage combo achieves +12% nDCG@10 vs dense-only on BEIR. For sparse retrieval: BGE-M3 outputs sparse token weights natively (no separate BM25 index needed) — use the sparse_vector output for hybrid dense+sparse retrieval (another +5% nDCG@10). BGE-M3 uses the standard BERT-base tokenizer — input truncation at 8192 tokens.

Featured models

Models in this family with our verdicts

BGE M3BGE Reranker v2 M3

Related — keep moving

Compare hardware
  • RTX 3090 vs RTX 4090 →
  • RTX 4090 vs RTX 5090 →
Buyer guides
  • Best budget GPU — embeddings need very little →
  • Best GPU for local AI →
  • Best laptop for local AI →
  • Best Mac for local AI →
  • Best used GPU for local AI →
When it doesn't work
  • CUDA out of memory →
  • Ollama running slowly →
  • ROCm not detected →
  • Model keeps crashing →
Before you buy

Verify BGE (BAAI General Embedding) runs on your specific hardware before committing money.

Will it run on my hardware? →Custom hardware comparison →GPU recommender (4 questions) →