Hardware buyer guide · 4 picksEditorialReviewed May 2026

Best GPU for Ollama

Honest 2026 GPU buyer guide for Ollama: 4060 Ti 16 GB, used 3090, 4090, 5090. Multi-model serving, FLASH_ATTENTION, real tok/s by tier.

By Fredoline Eruo · Last reviewed 2026-05-08

The short answer

For most Ollama operators, a used RTX 3090 24 GB at $700-1,000 is the right answer. 24 GB unlocks 70B Q4 with comfortable context, and Ollama's auto-CUDA-detect just works.

If you want new with warranty and your daily workload caps at 13-32B Q4, the RTX 4060 Ti 16 GB at $450-550 is the value entry. The RTX 4090 is the buy-and-don't-look-back 24 GB pick.

Ollama-specific reframe: the 32 GB on a RTX 5090 isn't about running bigger models — it's about running multiple models concurrently with OLLAMA_KEEP_ALIVE + OLLAMA_NUM_PARALLEL without VRAM thrashing.

The picks, ranked by buyer-leverage

RTX 4060 Ti 16 GB

full verdict →

16 GB · $450-550 (2026 retail)

The cheapest CUDA card Ollama can drive seriously. 13-32B Q4 comfortable; 70B Q4 fits at short context only.

Buy if

First-time Ollama users wanting CUDA + warranty
Single-model workflows (one 13-32B model loaded)
Builds prioritizing efficiency (165W TDP, quiet operation)

Skip if

Multi-model serving (16 GB blocks OLLAMA_KEEP_ALIVE on multiple 13B+ models)
70B inference at usable context (use 24 GB)
Long agent loops (288 GB/s bandwidth bottlenecks decode)

▼ CHECK CURRENT PRICE

Check on Amazon →

Affiliate disclosure: we earn a small commission on purchases made through these links. The opinion comes first.

RTX 3090 (used)

full verdict →

24 GB · $700-1,000 (2026 used)

The single highest-leverage Ollama buy in 2026. 24 GB unlocks 70B Q4 + room for parallel multi-model serving.

Buy if

Ollama operators running 70B Q4 day-to-day
Multi-model setups (OLLAMA_KEEP_ALIVE on 2-3 small models)
Best $/GB-VRAM at the 24 GB tier

Skip if

Buyers who hate used silicon and want a warranty
Power-budget-constrained builds (350W TDP)
Sustained training workloads (Ada is more efficient per watt)

▼ CHECK CURRENT PRICE

Check on Amazon →

Affiliate disclosure: we earn a small commission on purchases made through these links. The opinion comes first.

RTX 4090

full verdict →

24 GB · $1,400-1,900 used / $1,800-2,200 new

The 'buy it and don't look back' 24 GB pick for Ollama. Mature CUDA stack, every Ollama feature works flawlessly.

Buy if

Buyers wanting maximum 24 GB performance new with warranty
Long-context agent loops (1008 GB/s bandwidth holds up)
Single-card setups prioritizing ecosystem maturity

Skip if

Buyers willing to accept used silicon (3090 saves $400-700)
Multi-GPU rigs (used 3090×2 = 48 GB cheaper)
Stretching budget toward 32 GB on 5090

▼ CHECK CURRENT PRICE

Check on Amazon →

Affiliate disclosure: we earn a small commission on purchases made through these links. The opinion comes first.

RTX 5090

full verdict →

32 GB · $2,000-2,500 (2026 retail)

32 GB is overkill for any single Ollama model — but it's the right answer for parallel multi-model serving + 32K+ context windows.

Buy if

Production Ollama servers (multi-model, multi-user)
FP16 32B inference for evaluation workflows
32K+ context agent loops without flash-attention compromises

Skip if

Solo Ollama users (4090 covers everything you need)
Multi-GPU operators (4-slot form factor brutal in consumer cases)
PSU-constrained builds (575W TDP needs 1000W+)

▼ CHECK CURRENT PRICE

Check on Amazon →

Affiliate disclosure: we earn a small commission on purchases made through these links. The opinion comes first.

HonestyWhy benchmark numbers on this page might not reflect your real experience

tok/s is not user experience. Humans read at ~10-15 tok/s — anything above that is buffer time, not perceived speed.
Context length changes everything. A 70B Q4 model at 1024 tokens generates ~25 tok/s; the same model at 32K context drops to ~8-12 tok/s as KV cache fills.
Quantization changes the conclusion. Q4_K_M vs Q5_K_M vs Q8 produce different speed AND different quality. A benchmark at one quant doesn't translate to another.
Thermal throttling changes long sessions. The first 15 minutes of a benchmark see boost-clock peak; the next 4 hours see steady-state, which is 5-15% slower depending on case airflow.
Driver and runtime versions silently shift winners. A 2024 benchmark on PyTorch 2.4 + CUDA 12.4 doesn't reflect 2026 reality on PyTorch 2.6 + CUDA 12.6. Discount benchmarks older than 6 months.
Vendor and YouTuber benchmarks are cherry-picked. The standard 'Llama 3.1 70B Q4 at 1024 tokens' chart shows peak decode on a tiny prompt — exactly the conditions least representative of daily use.
Our ranking is by workload fit at the buyer's actual budget — not by raw benchmark order. A faster card that doesn't fit your workload ranks below a slower card that does.

We try to surface these caveats where they apply. If a number on this page reads more confident than it should, please email us via contact. See also our methodology and editorial philosophy.

How to think about VRAM tiers

Ollama auto-detects GPU and falls back to CPU silently if the model doesn't fit — which is why most 'Ollama is slow' tickets are actually 'wrong VRAM tier' tickets. Pick the tier that fits the largest model you'll actually run, with headroom for KV cache.

8 GB — Ollama works but caps at 7B Q4 only. Model library is artificially limited.
12 GB — 13B Q4 territory. Tight for KV cache at long context.
16 GB — 13-32B Q4 comfortable; 70B Q4 short-context only. The minimum modern tier.
24 GB (the Ollama sweet spot) — 70B Q4 with 4-8K context. Multi-model serving with OLLAMA_KEEP_ALIVE on 2-3 small models. Most users land here.
32 GB+ — FP16 32B + 32K context windows. Production multi-model serving. Diminishing returns for solo workflows.

Compare these picks head-to-head

RTX 3090 vs RTX 4090

Same 24 GB. Used vs new — when each wins for Ollama specifically.

RTX 4090 vs RTX 5090

When 32 GB matters for Ollama (hint: parallel multi-model).

RTX 3090 vs RTX 5080

Used 24 GB vs new 16 GB — Ollama's VRAM ceiling decides.

4060 Ti 16 GB vs 4070 Ti Super

Both 16 GB — bandwidth advantage on Ollama decode.

Frequently asked questions

Will Ollama use my GPU automatically?

Yes on NVIDIA with current drivers (CUDA), AMD with ROCm + gfx-version override, and Apple Silicon (Metal). Verify with `ollama ps` while a model is loaded — the PROCESSOR column shows GPU/CPU split. If it shows '100% GPU' you're good. If it shows CPU at all, the model didn't fit and Ollama fell back.

What's a good tok/s on Ollama with a 3090 / 4090?

Realistic ranges on RTX 3090 / 4090: 7B Q4 ~80-120 tok/s, 13B Q4 ~50-70, 32B Q4 ~25-35, 70B Q4 ~12-18. If you're 5-10x below these, Ollama has fallen back to CPU — usually because the model doesn't fit VRAM. See /troubleshooting/ollama-slow.

Can I run multiple models simultaneously on one GPU?

Yes via OLLAMA_KEEP_ALIVE (keeps models in VRAM longer) + OLLAMA_NUM_PARALLEL (concurrent requests per model). On a 24 GB card, two 13B Q4 models fit comfortably; three is tight. On a 16 GB card, only one 13B+ model at a time works reliably.

Why is Ollama slower than llama.cpp directly on my GPU?

Ollama defaults are conservative: smaller batch, no flash-attention by default, modest context. Set OLLAMA_FLASH_ATTENTION=1 (cuts KV cache ~30%), tune num_thread for prefill, set num_ctx appropriately for your workload. Often closes 80%+ of the gap.

Should I switch from Ollama to vLLM for production?

Yes if you're serving 10+ concurrent users or need maximum throughput. vLLM's paged KV cache + continuous batching outperforms Ollama at multi-user scale. Ollama wins for solo + dev workflows where simplicity matters more than peak throughput.

Does Ollama support AMD GPUs?

Yes via ROCm on Linux (and recent Windows ROCm). Set HSA_OVERRIDE_GFX_VERSION=11.0.0 for RDNA 3 cards (7900 XTX/XT). For older cards or Windows-native, llama.cpp's Vulkan backend is more reliable than Ollama's ROCm path.

Go deeper

Best GPU for local AI (pillar) — All picks ranked across runtime ecosystems
Best GPU for KoboldCpp — Sibling runtime — different VRAM/feature tradeoffs
Best local AI setup for beginners — First-week walkthrough — model + runtime + hardware
16 GB vs 24 GB VRAM — Whether the extra 8 GB pays for Ollama
Best used GPU — Used 3090 — the value pick that runs Ollama everything
Ollama running slow / on CPU — Diagnose the silent CPU fallback in 3 minutes
Ollama port conflict — When Ollama won't start
Will it run on my hardware? — Pre-purchase compatibility check

When it doesn't work

Hardware bought, set up correctly, still failing? The highest-volume local-AI errors and their fixes:

If this isn't the right fit

Common alternatives readers consider:

If your budget is tighter →best budget GPU for local AI
If you'd rather buy used →best used GPU for local AI
If you're on Apple Silicon →best Mac for local AI
If you're not sure what fits your build →the will-it-run checker
If you don't want to buy anything yet →our editorial philosophy