Best GPU for Ollama
Honest 2026 GPU buyer guide for Ollama: 4060 Ti 16 GB, used 3090, 4090, 5090. Multi-model serving, FLASH_ATTENTION, real tok/s by tier.
The short answer
For most Ollama operators, a used RTX 3090 24 GB at $700-1,000 is the right answer. 24 GB unlocks 70B Q4 with comfortable context, and Ollama's auto-CUDA-detect just works.
If you want new with warranty and your daily workload caps at 13-32B Q4, the RTX 4060 Ti 16 GB at $450-550 is the value entry. The RTX 4090 is the buy-and-don't-look-back 24 GB pick.
Ollama-specific reframe: the 32 GB on a RTX 5090 isn't about running bigger models — it's about running multiple models concurrently with OLLAMA_KEEP_ALIVE + OLLAMA_NUM_PARALLEL without VRAM thrashing.
The picks, ranked by buyer-leverage
16 GB · $450-550 (2026 retail)
The cheapest CUDA card Ollama can drive seriously. 13-32B Q4 comfortable; 70B Q4 fits at short context only.
- First-time Ollama users wanting CUDA + warranty
- Single-model workflows (one 13-32B model loaded)
- Builds prioritizing efficiency (165W TDP, quiet operation)
- Multi-model serving (16 GB blocks OLLAMA_KEEP_ALIVE on multiple 13B+ models)
- 70B inference at usable context (use 24 GB)
- Long agent loops (288 GB/s bandwidth bottlenecks decode)
24 GB · $700-1,000 (2026 used)
The single highest-leverage Ollama buy in 2026. 24 GB unlocks 70B Q4 + room for parallel multi-model serving.
- Ollama operators running 70B Q4 day-to-day
- Multi-model setups (OLLAMA_KEEP_ALIVE on 2-3 small models)
- Best $/GB-VRAM at the 24 GB tier
- Buyers who hate used silicon and want a warranty
- Power-budget-constrained builds (350W TDP)
- Sustained training workloads (Ada is more efficient per watt)
24 GB · $1,400-1,900 used / $1,800-2,200 new
The 'buy it and don't look back' 24 GB pick for Ollama. Mature CUDA stack, every Ollama feature works flawlessly.
- Buyers wanting maximum 24 GB performance new with warranty
- Long-context agent loops (1008 GB/s bandwidth holds up)
- Single-card setups prioritizing ecosystem maturity
- Buyers willing to accept used silicon (3090 saves $400-700)
- Multi-GPU rigs (used 3090×2 = 48 GB cheaper)
- Stretching budget toward 32 GB on 5090
32 GB · $2,000-2,500 (2026 retail)
32 GB is overkill for any single Ollama model — but it's the right answer for parallel multi-model serving + 32K+ context windows.
- Production Ollama servers (multi-model, multi-user)
- FP16 32B inference for evaluation workflows
- 32K+ context agent loops without flash-attention compromises
- Solo Ollama users (4090 covers everything you need)
- Multi-GPU operators (4-slot form factor brutal in consumer cases)
- PSU-constrained builds (575W TDP needs 1000W+)
HonestyWhy benchmark numbers on this page might not reflect your real experience
- tok/s is not user experience. Humans read at ~10-15 tok/s — anything above that is buffer time, not perceived speed.
- Context length changes everything. A 70B Q4 model at 1024 tokens generates ~25 tok/s; the same model at 32K context drops to ~8-12 tok/s as KV cache fills.
- Quantization changes the conclusion. Q4_K_M vs Q5_K_M vs Q8 produce different speed AND different quality. A benchmark at one quant doesn't translate to another.
- Thermal throttling changes long sessions. The first 15 minutes of a benchmark see boost-clock peak; the next 4 hours see steady-state, which is 5-15% slower depending on case airflow.
- Driver and runtime versions silently shift winners. A 2024 benchmark on PyTorch 2.4 + CUDA 12.4 doesn't reflect 2026 reality on PyTorch 2.6 + CUDA 12.6. Discount benchmarks older than 6 months.
- Vendor and YouTuber benchmarks are cherry-picked. The standard 'Llama 3.1 70B Q4 at 1024 tokens' chart shows peak decode on a tiny prompt — exactly the conditions least representative of daily use.
- Our ranking is by workload fit at the buyer's actual budget — not by raw benchmark order. A faster card that doesn't fit your workload ranks below a slower card that does.
We try to surface these caveats where they apply. If a number on this page reads more confident than it should, please email us via contact. See also our methodology and editorial philosophy.
How to think about VRAM tiers
Ollama auto-detects GPU and falls back to CPU silently if the model doesn't fit — which is why most 'Ollama is slow' tickets are actually 'wrong VRAM tier' tickets. Pick the tier that fits the largest model you'll actually run, with headroom for KV cache.
- 8 GB — Ollama works but caps at 7B Q4 only. Model library is artificially limited.
- 12 GB — 13B Q4 territory. Tight for KV cache at long context.
- 16 GB — 13-32B Q4 comfortable; 70B Q4 short-context only. The minimum modern tier.
- 24 GB (the Ollama sweet spot) — 70B Q4 with 4-8K context. Multi-model serving with OLLAMA_KEEP_ALIVE on 2-3 small models. Most users land here.
- 32 GB+ — FP16 32B + 32K context windows. Production multi-model serving. Diminishing returns for solo workflows.
Compare these picks head-to-head
Same 24 GB. Used vs new — when each wins for Ollama specifically.
When 32 GB matters for Ollama (hint: parallel multi-model).
Used 24 GB vs new 16 GB — Ollama's VRAM ceiling decides.
Both 16 GB — bandwidth advantage on Ollama decode.
Frequently asked questions
Will Ollama use my GPU automatically?
Yes on NVIDIA with current drivers (CUDA), AMD with ROCm + gfx-version override, and Apple Silicon (Metal). Verify with `ollama ps` while a model is loaded — the PROCESSOR column shows GPU/CPU split. If it shows '100% GPU' you're good. If it shows CPU at all, the model didn't fit and Ollama fell back.
What's a good tok/s on Ollama with a 3090 / 4090?
Realistic ranges on RTX 3090 / 4090: 7B Q4 ~80-120 tok/s, 13B Q4 ~50-70, 32B Q4 ~25-35, 70B Q4 ~12-18. If you're 5-10x below these, Ollama has fallen back to CPU — usually because the model doesn't fit VRAM. See /troubleshooting/ollama-slow.
Can I run multiple models simultaneously on one GPU?
Yes via OLLAMA_KEEP_ALIVE (keeps models in VRAM longer) + OLLAMA_NUM_PARALLEL (concurrent requests per model). On a 24 GB card, two 13B Q4 models fit comfortably; three is tight. On a 16 GB card, only one 13B+ model at a time works reliably.
Why is Ollama slower than llama.cpp directly on my GPU?
Ollama defaults are conservative: smaller batch, no flash-attention by default, modest context. Set OLLAMA_FLASH_ATTENTION=1 (cuts KV cache ~30%), tune num_thread for prefill, set num_ctx appropriately for your workload. Often closes 80%+ of the gap.
Should I switch from Ollama to vLLM for production?
Yes if you're serving 10+ concurrent users or need maximum throughput. vLLM's paged KV cache + continuous batching outperforms Ollama at multi-user scale. Ollama wins for solo + dev workflows where simplicity matters more than peak throughput.
Does Ollama support AMD GPUs?
Yes via ROCm on Linux (and recent Windows ROCm). Set HSA_OVERRIDE_GFX_VERSION=11.0.0 for RDNA 3 cards (7900 XTX/XT). For older cards or Windows-native, llama.cpp's Vulkan backend is more reliable than Ollama's ROCm path.
Go deeper
- Best GPU for local AI (pillar) — All picks ranked across runtime ecosystems
- Best GPU for KoboldCpp — Sibling runtime — different VRAM/feature tradeoffs
- Best local AI setup for beginners — First-week walkthrough — model + runtime + hardware
- 16 GB vs 24 GB VRAM — Whether the extra 8 GB pays for Ollama
- Best used GPU — Used 3090 — the value pick that runs Ollama everything
- Ollama running slow / on CPU — Diagnose the silent CPU fallback in 3 minutes
- Ollama port conflict — When Ollama won't start
- Will it run on my hardware? — Pre-purchase compatibility check
When it doesn't work
Hardware bought, set up correctly, still failing? The highest-volume local-AI errors and their fixes:
Common alternatives readers consider:
- If your budget is tighter →best budget GPU for local AI
- If you'd rather buy used →best used GPU for local AI
- If you're on Apple Silicon →best Mac for local AI
- If you're not sure what fits your build →the will-it-run checker
- If you don't want to buy anything yet →our editorial philosophy