What can NVIDIA GeForce RTX 5090 run for vision?

Build: NVIDIA GeForce RTX 5090 + — + 32 GB RAM (windows)

Memory: 32 GB VRAM + 32 GB system RAM
Runner: llama.cpp / Ollama (CUDA)

Runs comfortably
9 models

Ranked by fit for vision use case + predicted speed. Click a row for VRAM breakdown.

#1Gemma 4 E2B (Effective 2B)
2B
gemma
Commercial OK
Quant: Q8_0Context: 8,192VRAM: 13.2 GBHeadroom: 18.8 GBTTFT: instant
ollama run gemma4:e2b
548
tok/s
E
Weights
2.13 GB
KV cache
1.00 GB
Activations
8.30 GB
Runtime
1.80 GB
Time to first token (prefill, 512-token prompt): ~41 ms (instant)
#2Phi-3.5 Vision
4.2B
phi
Commercial OK
Quant: Q4_K_MContext: 8,192VRAM: 14.8 GBHeadroom: 17.2 GBTTFT: instant
459
tok/s
E
Weights
2.54 GB
KV cache
2.10 GB
Activations
8.32 GB
Runtime
1.80 GB
Time to first token (prefill, 512-token prompt): ~86 ms (instant)
#3Gemma 4 E4B (Effective 4B)
4B
gemma
Commercial OK
Quant: Q8_0Context: 8,192VRAM: 16.5 GBHeadroom: 15.5 GBTTFT: instant
ollama run gemma4:e4b
274
tok/s
E
Weights
4.25 GB
KV cache
2.00 GB
Activations
8.40 GB
Runtime
1.80 GB
Time to first token (prefill, 512-token prompt): ~82 ms (instant)
#4Gemma 3 4B
4B
gemma
Commercial OK
Quant: Q8_0Context: 8,192VRAM: 16.5 GBHeadroom: 15.5 GBTTFT: instant
ollama run gemma3:4b
274
tok/s
E
Weights
4.25 GB
KV cache
2.00 GB
Activations
8.40 GB
Runtime
1.80 GB
Time to first token (prefill, 512-token prompt): ~82 ms (instant)
#5Llama 3.2 11B Vision Instruct
11B
llama
Commercial OK
Quant: Q8_0Context: 8,192VRAM: 27.8 GBHeadroom: 4.2 GBTTFT: fast
ollama run llama3.2-vision:11b
100
tok/s
E
Weights
11.69 GB
KV cache
5.50 GB
Activations
8.78 GB
Runtime
1.80 GB
Time to first token (prefill, 512-token prompt): ~225 ms (fast)
#6Gemma 4 26B MoE
26B
gemma
Commercial OK
Quant: Q4_K_MContext: 2,048VRAM: 23.6 GBHeadroom: 8.4 GBTTFT: noticeable
ollama run gemma4:26b-moe
74
tok/s
E
Weights
15.70 GB
KV cache
3.25 GB
Activations
2.83 GB
Runtime
1.80 GB
Time to first token (prefill, 512-token prompt): ~532 ms (noticeable)
#7Gemma 3 27B
27B
gemma
Commercial OK
Quant: Q4_K_MContext: 2,048VRAM: 24.3 GBHeadroom: 7.7 GBTTFT: noticeable
ollama run gemma3:27b
71
tok/s
E
Weights
16.30 GB
KV cache
3.38 GB
Activations
2.86 GB
Runtime
1.80 GB
Time to first token (prefill, 512-token prompt): ~553 ms (noticeable)
#8MedGemma 27B
27B
gemma
Quant: Q4_K_MContext: 2,048VRAM: 24.3 GBHeadroom: 7.7 GBTTFT: noticeable
71
tok/s
E
Weights
16.30 GB
KV cache
3.38 GB
Activations
2.86 GB
Runtime
1.80 GB
Time to first token (prefill, 512-token prompt): ~553 ms (noticeable)
#9Gemma 4 31B Dense
31B
gemma
Commercial OK
Quant: Q4_K_MContext: 2,048VRAM: 27.4 GBHeadroom: 4.6 GBTTFT: noticeable
ollama run gemma4:31b
62
tok/s
E
Weights
18.72 GB
KV cache
3.88 GB
Activations
2.98 GB
Runtime
1.80 GB
Time to first token (prefill, 512-token prompt): ~635 ms (noticeable)

Runs with tradeoffs
2 models

Tight VRAM, partial CPU offload, or context-limited.

Gemma 3 12B
12B
gemma
Commercial OK
Quant: Q8_0Context: 8,192VRAM: 29.4 GBHeadroom: 2.6 GBTTFT: fast
  • Tight VRAM fit — only 2.6 GB headroom left for context growth
ollama run gemma3:12b
91
tok/s
E
Weights
12.75 GB
KV cache
6.00 GB
Activations
8.83 GB
Runtime
1.80 GB
Time to first token (prefill, 512-token prompt): ~246 ms (fast)
Pixtral 12B
12B
mistral
Commercial OK
Quant: Q8_0Context: 8,192VRAM: 29.4 GBHeadroom: 2.6 GBTTFT: fast
  • Tight VRAM fit — only 2.6 GB headroom left for context growth
ollama run pixtral:12b
91
tok/s
E
Weights
12.75 GB
KV cache
6.00 GB
Activations
8.83 GB
Runtime
1.80 GB
Time to first token (prefill, 512-token prompt): ~246 ms (fast)

What if you upgraded?

Hypothetical scenarios. We re-ran the compatibility engine for each.

+32 GB system RAM

~$80–150

Doubles your CPU-offload working set. Helps when models don't quite fit in VRAM.

Unlocks: 25 new comfortable, 18 new tradeoff

  • Gemma 3 1B
  • Llama 3.2 1B Instruct
  • Llama 3.2 3B Instruct
  • Phi-3.5 Mini Instruct

Upgrade to NVIDIA A100 40GB

see current pricing

40 GB VRAM (vs your 32 GB) plus a bandwidth jump from ~1792 GB/s to ~? GB/s.

Unlocks: 34 new comfortable

  • Gemma 3 1B
  • Llama 3.2 1B Instruct
  • Llama 3.2 3B Instruct
  • Phi-3.5 Mini Instruct

Add a second NVIDIA GeForce RTX 5090

~$2499

Tensor parallelism splits the model across both cards, effectively doubling VRAM. Bandwidth doesn't double — runs ~1.5× the single-card speed in practice.

Unlocks: 38 new comfortable

  • Gemma 3 1B
  • Llama 3.2 1B Instruct
  • Llama 3.2 3B Instruct
  • Phi-3.5 Mini Instruct

Some links above are affiliate links. We may earn a commission at no extra cost to you. How we make money.

Won't run
top 5 popular models

Need more memory than you have. Shown for orientation.

Qwen 3 235B-A22B
235B
qwen
Commercial OK

Even with CPU offload, needs more memory than your VRAM (32 GB) + 60% of system RAM (19 GB) combined.

Llama 4 Scout
109B
llama
Commercial OK

Even with CPU offload, needs more memory than your VRAM (32 GB) + 60% of system RAM (19 GB) combined.

DeepSeek R1 (671B reasoning)
671B
deepseek
Commercial OK

Even with CPU offload, needs more memory than your VRAM (32 GB) + 60% of system RAM (19 GB) combined.

DeepSeek R1 Distill Llama 70B
70B
deepseek
Commercial OK

Even with CPU offload, needs more memory than your VRAM (32 GB) + 60% of system RAM (19 GB) combined.

GLM-5
200B
other
Commercial OK

Even with CPU offload, needs more memory than your VRAM (32 GB) + 60% of system RAM (19 GB) combined.

How to read these numbers

M
Measured — we ran this exact combo on owner hardware.

~
Extrapolated — predicted from a measured benchmark on similar-bandwidth hardware.

E
Estimated — pure formula based on VRAM bandwidth and model architecture.

Full methodology →

Want a specific benchmark we don't have? Email benchmarks@runlocalai.co and we'll prioritize it.