Hardware buyer guide · 4 picksEditorialReviewed May 2026

Best GPU for Ollama

Honest 2026 GPU buyer guide for Ollama: 4060 Ti 16 GB, used 3090, 4090, 5090. Multi-model serving, FLASH_ATTENTION, real tok/s by tier.

By Fredoline Eruo · Last reviewed 2026-05-08

The short answer

For most Ollama operators, a used RTX 3090 24 GB at $700-1,000 is the right answer. 24 GB unlocks 70B Q4 with comfortable context, and Ollama's auto-CUDA-detect just works.

If you want new with warranty and your daily workload caps at 13-32B Q4, the RTX 4060 Ti 16 GB at $450-550 is the value entry. The RTX 4090 is the buy-and-don't-look-back 24 GB pick.

Ollama-specific reframe: the 32 GB on a RTX 5090 isn't about running bigger models — it's about running multiple models concurrently with OLLAMA_KEEP_ALIVE + OLLAMA_NUM_PARALLEL without VRAM thrashing.

The picks, ranked by buyer-leverage

#1

RTX 4060 Ti 16 GB

full verdict →

16 GB · $450-550 (2026 retail)

The cheapest CUDA card Ollama can drive seriously. 13-32B Q4 comfortable; 70B Q4 fits at short context only.

Buy if
  • First-time Ollama users wanting CUDA + warranty
  • Single-model workflows (one 13-32B model loaded)
  • Builds prioritizing efficiency (165W TDP, quiet operation)
Skip if
  • Multi-model serving (16 GB blocks OLLAMA_KEEP_ALIVE on multiple 13B+ models)
  • 70B inference at usable context (use 24 GB)
  • Long agent loops (288 GB/s bandwidth bottlenecks decode)
▼ CHECK CURRENT PRICE
Affiliate disclosure: we earn a small commission on purchases made through these links. The opinion comes first.
#2

RTX 3090 (used)

full verdict →

24 GB · $700-1,000 (2026 used)

The single highest-leverage Ollama buy in 2026. 24 GB unlocks 70B Q4 + room for parallel multi-model serving.

Buy if
  • Ollama operators running 70B Q4 day-to-day
  • Multi-model setups (OLLAMA_KEEP_ALIVE on 2-3 small models)
  • Best $/GB-VRAM at the 24 GB tier
Skip if
  • Buyers who hate used silicon and want a warranty
  • Power-budget-constrained builds (350W TDP)
  • Sustained training workloads (Ada is more efficient per watt)
▼ CHECK CURRENT PRICE
Affiliate disclosure: we earn a small commission on purchases made through these links. The opinion comes first.

24 GB · $1,400-1,900 used / $1,800-2,200 new

The 'buy it and don't look back' 24 GB pick for Ollama. Mature CUDA stack, every Ollama feature works flawlessly.

Buy if
  • Buyers wanting maximum 24 GB performance new with warranty
  • Long-context agent loops (1008 GB/s bandwidth holds up)
  • Single-card setups prioritizing ecosystem maturity
Skip if
  • Buyers willing to accept used silicon (3090 saves $400-700)
  • Multi-GPU rigs (used 3090×2 = 48 GB cheaper)
  • Stretching budget toward 32 GB on 5090
▼ CHECK CURRENT PRICE
Affiliate disclosure: we earn a small commission on purchases made through these links. The opinion comes first.

32 GB · $2,000-2,500 (2026 retail)

32 GB is overkill for any single Ollama model — but it's the right answer for parallel multi-model serving + 32K+ context windows.

Buy if
  • Production Ollama servers (multi-model, multi-user)
  • FP16 32B inference for evaluation workflows
  • 32K+ context agent loops without flash-attention compromises
Skip if
  • Solo Ollama users (4090 covers everything you need)
  • Multi-GPU operators (4-slot form factor brutal in consumer cases)
  • PSU-constrained builds (575W TDP needs 1000W+)
▼ CHECK CURRENT PRICE
Affiliate disclosure: we earn a small commission on purchases made through these links. The opinion comes first.
HonestyWhy benchmark numbers on this page might not reflect your real experience
  • tok/s is not user experience. Humans read at ~10-15 tok/s — anything above that is buffer time, not perceived speed.
  • Context length changes everything. A 70B Q4 model at 1024 tokens generates ~25 tok/s; the same model at 32K context drops to ~8-12 tok/s as KV cache fills.
  • Quantization changes the conclusion. Q4_K_M vs Q5_K_M vs Q8 produce different speed AND different quality. A benchmark at one quant doesn't translate to another.
  • Thermal throttling changes long sessions. The first 15 minutes of a benchmark see boost-clock peak; the next 4 hours see steady-state, which is 5-15% slower depending on case airflow.
  • Driver and runtime versions silently shift winners. A 2024 benchmark on PyTorch 2.4 + CUDA 12.4 doesn't reflect 2026 reality on PyTorch 2.6 + CUDA 12.6. Discount benchmarks older than 6 months.
  • Vendor and YouTuber benchmarks are cherry-picked. The standard 'Llama 3.1 70B Q4 at 1024 tokens' chart shows peak decode on a tiny prompt — exactly the conditions least representative of daily use.
  • Our ranking is by workload fit at the buyer's actual budget — not by raw benchmark order. A faster card that doesn't fit your workload ranks below a slower card that does.

We try to surface these caveats where they apply. If a number on this page reads more confident than it should, please email us via contact. See also our methodology and editorial philosophy.

How to think about VRAM tiers

Ollama auto-detects GPU and falls back to CPU silently if the model doesn't fit — which is why most 'Ollama is slow' tickets are actually 'wrong VRAM tier' tickets. Pick the tier that fits the largest model you'll actually run, with headroom for KV cache.

  • 8 GBOllama works but caps at 7B Q4 only. Model library is artificially limited.
  • 12 GB13B Q4 territory. Tight for KV cache at long context.
  • 16 GB13-32B Q4 comfortable; 70B Q4 short-context only. The minimum modern tier.
  • 24 GB (the Ollama sweet spot)70B Q4 with 4-8K context. Multi-model serving with OLLAMA_KEEP_ALIVE on 2-3 small models. Most users land here.
  • 32 GB+FP16 32B + 32K context windows. Production multi-model serving. Diminishing returns for solo workflows.

Compare these picks head-to-head

Frequently asked questions

Will Ollama use my GPU automatically?

Yes on NVIDIA with current drivers (CUDA), AMD with ROCm + gfx-version override, and Apple Silicon (Metal). Verify with `ollama ps` while a model is loaded — the PROCESSOR column shows GPU/CPU split. If it shows '100% GPU' you're good. If it shows CPU at all, the model didn't fit and Ollama fell back.

What's a good tok/s on Ollama with a 3090 / 4090?

Realistic ranges on RTX 3090 / 4090: 7B Q4 ~80-120 tok/s, 13B Q4 ~50-70, 32B Q4 ~25-35, 70B Q4 ~12-18. If you're 5-10x below these, Ollama has fallen back to CPU — usually because the model doesn't fit VRAM. See /troubleshooting/ollama-slow.

Can I run multiple models simultaneously on one GPU?

Yes via OLLAMA_KEEP_ALIVE (keeps models in VRAM longer) + OLLAMA_NUM_PARALLEL (concurrent requests per model). On a 24 GB card, two 13B Q4 models fit comfortably; three is tight. On a 16 GB card, only one 13B+ model at a time works reliably.

Why is Ollama slower than llama.cpp directly on my GPU?

Ollama defaults are conservative: smaller batch, no flash-attention by default, modest context. Set OLLAMA_FLASH_ATTENTION=1 (cuts KV cache ~30%), tune num_thread for prefill, set num_ctx appropriately for your workload. Often closes 80%+ of the gap.

Should I switch from Ollama to vLLM for production?

Yes if you're serving 10+ concurrent users or need maximum throughput. vLLM's paged KV cache + continuous batching outperforms Ollama at multi-user scale. Ollama wins for solo + dev workflows where simplicity matters more than peak throughput.

Does Ollama support AMD GPUs?

Yes via ROCm on Linux (and recent Windows ROCm). Set HSA_OVERRIDE_GFX_VERSION=11.0.0 for RDNA 3 cards (7900 XTX/XT). For older cards or Windows-native, llama.cpp's Vulkan backend is more reliable than Ollama's ROCm path.

Go deeper

When it doesn't work

Hardware bought, set up correctly, still failing? The highest-volume local-AI errors and their fixes:

If this isn't the right fit

Common alternatives readers consider: