Best GPU for local RAG (retrieval-augmented generation)
Honest 2026 guide to picking GPUs for local RAG: embedding model + LLM concurrent. VRAM math for hybrid workflows, why most operators overspend on the embedding side.
The short answer
Local RAG = embedding model (small) + retrieval (CPU-bound) + LLM (the real GPU consumer). Most operators overspend assuming embeddings need a powerful GPU. They don't.
Single-GPU RAG: used RTX 3090 24 GB at $800 runs embedding model + Llama 3.3 70B Q4 concurrent. The leverage pick.
Two-machine RAG: cheap CPU box for embedding indexing (no GPU needed; sentence-transformers fast on Ryzen 7000) + dedicated GPU for LLM serving. Often the smarter split.
The picks, ranked by buyer-leverage
16 GB · $450-550 (2026 retail)
Embedding (<1 GB) + 13B LLM Q4 (~7 GB) + KV cache fits comfortably. Solo-user RAG default.
- Solo / dev RAG workflows (<10K documents indexed)
- 13B-class LLM serving the RAG queries
- First-time RAG operators learning the stack
- 70B LLM RAG (16 GB blocks the LLM)
- Production multi-user RAG serving
- Document-heavy embedding workloads (CPU-only is often faster)
24 GB · $700-1,000 (2026 used)
24 GB unlocks embedding + Llama 3.3 70B Q4 concurrent. The leverage RAG pick.
- RAG with 70B LLM + embedding model concurrent
- Multi-document context windows (long context KV)
- Production RAG serving (single-user / small team)
- Embedding-only workloads (massively overspent — use CPU)
- Multi-user production RAG (vLLM tensor-parallel scales better)
- Buyers who hate used silicon
24 GB · $1,400-1,900 used / $1,800-2,200 new
Same 24 GB ceiling but Ada efficiency for sustained RAG serving. New + warranty.
- Production RAG serving (1-5 concurrent users)
- RAG + image gen mixed workloads
- New + warranty preference for serious work
- Multi-GPU operators (used 3090 cheaper for tensor-parallel)
- Single-user dev RAG (4060 Ti is enough)
- Embedding-only workloads
HonestyWhy benchmark numbers on this page might not reflect your real experience
- tok/s is not user experience. Humans read at ~10-15 tok/s — anything above that is buffer time, not perceived speed.
- Context length changes everything. A 70B Q4 model at 1024 tokens generates ~25 tok/s; the same model at 32K context drops to ~8-12 tok/s as KV cache fills.
- Quantization changes the conclusion. Q4_K_M vs Q5_K_M vs Q8 produce different speed AND different quality. A benchmark at one quant doesn't translate to another.
- Thermal throttling changes long sessions. The first 15 minutes of a benchmark see boost-clock peak; the next 4 hours see steady-state, which is 5-15% slower depending on case airflow.
- Driver and runtime versions silently shift winners. A 2024 benchmark on PyTorch 2.4 + CUDA 12.4 doesn't reflect 2026 reality on PyTorch 2.6 + CUDA 12.6. Discount benchmarks older than 6 months.
- Vendor and YouTuber benchmarks are cherry-picked. The standard 'Llama 3.1 70B Q4 at 1024 tokens' chart shows peak decode on a tiny prompt — exactly the conditions least representative of daily use.
- Our ranking is by workload fit at the buyer's actual budget — not by raw benchmark order. A faster card that doesn't fit your workload ranks below a slower card that does.
We try to surface these caveats where they apply. If a number on this page reads more confident than it should, please email us via contact. See also our methodology and editorial philosophy.
How to think about VRAM tiers
RAG is hybrid: embedding model (~0.5-1 GB), LLM weights (the dominant consumer), KV cache (grows with retrieved-context length). Sizing should be LLM-driven; the embedding overhead is negligible.
- 4-8 GB — Embedding-only pipeline (no LLM). CPU + small embedding model.
- 12 GB — Embedding + 7B LLM. Solo / dev RAG.
- 16 GB — Embedding + 13B LLM Q4 with comfortable KV cache.
- 24 GB (RAG sweet spot) — Embedding + 70B Q4 LLM with retrieved-context headroom.
- 32 GB+ — Embedding + 70B LLM at very long retrieved context (32K+) or production multi-user.
Compare these picks head-to-head
Frequently asked questions
Do I need a GPU for the embedding model?
Usually no. Sentence-transformers / E5 / BGE / Qwen3-Embedding all run fast on modern CPUs (~50-200 documents/sec on Ryzen 7000). GPU helps for very large batch indexing (>10K docs), but most operators don't need it. Save the GPU budget for the LLM.
What's a sensible RAG hardware split?
Single-machine: 24 GB GPU runs embedding + 70B LLM. Two-machine: cheap CPU box ($500) for indexing + retrieval, dedicated GPU machine for LLM. The latter often gives more total capability for the same budget.
Should I use a smaller LLM with bigger context vs larger LLM with smaller context?
Workload-dependent. For factual RAG (precise answers from retrieved docs), a 13B model with 16K context often outperforms 70B at 4K context. For analytical RAG (synthesis across many docs), larger models with longer context help more. Test on your specific corpus.
How does RAG VRAM math differ from chat?
RAG context windows are larger (retrieved docs + query). KV cache grows linearly with context. A 70B Q4 model at 4K chat context uses ~22 GB; at 32K RAG context it uses ~28 GB. Plan VRAM for max-context, not min-context.
Go deeper
- Best GPU for local AI (pillar) — All workloads ranked
- Best GPU for Llama — Llama models are common RAG LLM picks
- Retrieval task — Full task page with workflow + tooling guidance
- LlamaIndex — Reference RAG framework
When it doesn't work
Hardware bought, set up correctly, still failing? The highest-volume local-AI errors and their fixes:
Common alternatives readers consider:
- If your budget is tighter →best budget GPU for local AI
- If you'd rather buy used →best used GPU for local AI
- If you're on Apple Silicon →best Mac for local AI
- If you're not sure what fits your build →the will-it-run checker
- If you don't want to buy anything yet →our editorial philosophy