What can NVIDIA GeForce RTX 4090 run for agents?

Build: NVIDIA GeForce RTX 4090 + — + 32 GB RAM (windows)

Memory: 24 GB VRAM + 32 GB system RAM
Runner: llama.cpp / Ollama (CUDA)

Runs comfortably
20 models

Ranked by fit for agents use case + predicted speed. Click a row for VRAM breakdown.

#1Gemma 3 1B
1B
gemma
Commercial OK
Quant: Q4_K_MContext: 8,192VRAM: 11.1 GBHeadroom: 12.9 GBTTFT: instant
ollama run gemma3:1b
1085
tok/s
E
Weights
0.60 GB
KV cache
0.50 GB
Activations
8.22 GB
Runtime
1.80 GB
Time to first token (prefill, 512-token prompt): ~31 ms (instant)
#2Llama 3.2 1B Instruct
1B
llama
Commercial OK
Quant: Q8_0Context: 8,192VRAM: 11.6 GBHeadroom: 12.4 GBTTFT: instant
ollama run llama3.2:1b
617
tok/s
E
Weights
1.06 GB
KV cache
0.50 GB
Activations
8.25 GB
Runtime
1.80 GB
Time to first token (prefill, 512-token prompt): ~31 ms (instant)
#3Gemma 4 E2B (Effective 2B)
2B
gemma
Commercial OK
Quant: Q8_0Context: 8,192VRAM: 13.2 GBHeadroom: 10.8 GBTTFT: instant
ollama run gemma4:e2b
308
tok/s
E
Weights
2.13 GB
KV cache
1.00 GB
Activations
8.30 GB
Runtime
1.80 GB
Time to first token (prefill, 512-token prompt): ~62 ms (instant)
#4Phi-3.5 Vision
4.2B
phi
Commercial OK
Quant: Q4_K_MContext: 8,192VRAM: 14.8 GBHeadroom: 9.2 GBTTFT: fast
258
tok/s
E
Weights
2.54 GB
KV cache
2.10 GB
Activations
8.32 GB
Runtime
1.80 GB
Time to first token (prefill, 512-token prompt): ~130 ms (fast)
#5Llama 3.2 3B Instruct
3B
llama
Commercial OK
Quant: Q8_0Context: 8,192VRAM: 14.8 GBHeadroom: 9.2 GBTTFT: instant
ollama run llama3.2:3b
206
tok/s
E
Weights
3.19 GB
KV cache
1.50 GB
Activations
8.35 GB
Runtime
1.80 GB
Time to first token (prefill, 512-token prompt): ~93 ms (instant)
#6Mistral 7B Instruct v0.3
7B
mistral
Commercial OK
Quant: Q5_K_MContext: 8,192VRAM: 18.5 GBHeadroom: 5.5 GBTTFT: fast
ollama run mistral:7b
136
tok/s
E
Weights
4.81 GB
KV cache
3.50 GB
Activations
8.43 GB
Runtime
1.80 GB
Time to first token (prefill, 512-token prompt): ~217 ms (fast)
#7Llama 3.1 Nemotron Nano 8B
8B
llama
Commercial OK
Quant: Q4_K_MContext: 8,192VRAM: 19.1 GBHeadroom: 4.9 GBTTFT: fast
136
tok/s
E
Weights
4.83 GB
KV cache
4.00 GB
Activations
8.43 GB
Runtime
1.80 GB
Time to first token (prefill, 512-token prompt): ~248 ms (fast)
#8Qwen 3 4B
4B
qwen
Commercial OK
Quant: Q8_0Context: 8,192VRAM: 16.5 GBHeadroom: 7.5 GBTTFT: fast
ollama run qwen3:4b
154
tok/s
E
Weights
4.25 GB
KV cache
2.00 GB
Activations
8.40 GB
Runtime
1.80 GB
Time to first token (prefill, 512-token prompt): ~124 ms (fast)
#9Phi-3.5 Mini Instruct
3.8B
phi
Commercial OK
Quant: Q8_0Context: 8,192VRAM: 16.1 GBHeadroom: 7.9 GBTTFT: fast
ollama run phi3.5:3.8b
162
tok/s
E
Weights
4.04 GB
KV cache
1.90 GB
Activations
8.39 GB
Runtime
1.80 GB
Time to first token (prefill, 512-token prompt): ~118 ms (fast)
#10Qwen 2.5 7B Instruct
7B
qwen
Commercial OK
Quant: Q8_0Context: 8,192VRAM: 18.3 GBHeadroom: 5.7 GBTTFT: fast
ollama run qwen2.5:7b
88
tok/s
E
Weights
7.44 GB
KV cache
0.47 GB
Activations
8.56 GB
Runtime
1.80 GB
Time to first token (prefill, 512-token prompt): ~217 ms (fast)
#11CodeGemma 7B
7B
gemma
Commercial OK
Quant: Q4_K_MContext: 8,192VRAM: 17.9 GBHeadroom: 6.1 GBTTFT: fast
ollama run codegemma:7b
155
tok/s
E
Weights
4.23 GB
KV cache
3.50 GB
Activations
8.40 GB
Runtime
1.80 GB
Time to first token (prefill, 512-token prompt): ~217 ms (fast)
#12Gemma 4 E4B (Effective 4B)
4B
gemma
Commercial OK
Quant: Q8_0Context: 8,192VRAM: 16.5 GBHeadroom: 7.5 GBTTFT: fast
ollama run gemma4:e4b
154
tok/s
E
Weights
4.25 GB
KV cache
2.00 GB
Activations
8.40 GB
Runtime
1.80 GB
Time to first token (prefill, 512-token prompt): ~124 ms (fast)

Runs with tradeoffs
26 models

Tight VRAM, partial CPU offload, or context-limited.

Hermes 3 Llama 3.1 8B
8B
hermes
Commercial OK
Quant: Q8_0Context: 8,192VRAM: 22.9 GBHeadroom: 1.1 GBTTFT: fast
  • Tight VRAM fit — only 1.1 GB headroom left for context growth
ollama run hermes3:8b
77
tok/s
E
Weights
8.50 GB
KV cache
4.00 GB
Activations
8.62 GB
Runtime
1.80 GB
Time to first token (prefill, 512-token prompt): ~248 ms (fast)
Gemma 2 9B Instruct
9B
gemma
Commercial OK
Quant: Q4_K_MContext: 8,192VRAM: 20.2 GBHeadroom: 3.8 GBTTFT: fast
  • Tight VRAM fit — only 3.8 GB headroom left for context growth
ollama run gemma2:9b
121
tok/s
E
Weights
5.43 GB
KV cache
4.50 GB
Activations
8.46 GB
Runtime
1.80 GB
Time to first token (prefill, 512-token prompt): ~279 ms (fast)
Dolphin 3.0 Mistral 24B
24B
dolphin
Commercial OK
Quant: Q4_K_MContext: 2,048VRAM: 22.1 GBHeadroom: 1.9 GBTTFT: noticeable
  • Tight VRAM fit — only 1.9 GB headroom left for context growth
ollama run dolphin-mistral:24b
45
tok/s
E
Weights
14.49 GB
KV cache
3.00 GB
Activations
2.77 GB
Runtime
1.80 GB
Time to first token (prefill, 512-token prompt): ~744 ms (noticeable)
Qwen 3 30B-A3B
30B
qwen
Commercial OK
Quant: Q4_K_MContext: 2,048VRAM: 26.6 GBHeadroom: 16.6 GBTTFT: noticeable
  • Partial CPU offload: ~10% of layers run on CPU
ollama run qwen3:30b
36
tok/s
E
Weights
18.11 GB
KV cache
3.75 GB
Activations
2.95 GB
Runtime
1.80 GB
Time to first token (prefill, 512-token prompt): ~930 ms (noticeable)
Qwen 2.5 Coder 32B Instruct
32B
qwen
Commercial OK
Quant: Q4_K_MContext: 8,192VRAM: 32.4 GBHeadroom: 10.8 GBTTFT: noticeable
  • Partial CPU offload: ~26% of layers run on CPU
ollama run qwen2.5-coder:32b
34
tok/s
E
Weights
19.32 GB
KV cache
2.15 GB
Activations
9.16 GB
Runtime
1.80 GB
Time to first token (prefill, 512-token prompt): ~992 ms (noticeable)
Qwen 3 32B
32B
qwen
Commercial OK
Quant: Q4_K_MContext: 2,048VRAM: 28.1 GBHeadroom: 15.1 GBTTFT: noticeable
  • Partial CPU offload: ~15% of layers run on CPU
ollama run qwen3:32b
34
tok/s
E
Weights
19.32 GB
KV cache
4.00 GB
Activations
3.01 GB
Runtime
1.80 GB
Time to first token (prefill, 512-token prompt): ~992 ms (noticeable)
Qwen 2.5 32B Instruct
32B
qwen
Commercial OK
Quant: Q4_K_MContext: 2,048VRAM: 28.1 GBHeadroom: 15.1 GBTTFT: noticeable
  • Partial CPU offload: ~15% of layers run on CPU
ollama run qwen2.5:32b
34
tok/s
E
Weights
19.32 GB
KV cache
4.00 GB
Activations
3.01 GB
Runtime
1.80 GB
Time to first token (prefill, 512-token prompt): ~992 ms (noticeable)
Nemotron 3 Nano (30B-A3B)
30B
other
Commercial OK
Quant: Q4_K_MContext: 2,048VRAM: 26.6 GBHeadroom: 16.6 GBTTFT: noticeable
  • Partial CPU offload: ~10% of layers run on CPU
ollama run nemotron3:nano
36
tok/s
E
Weights
18.11 GB
KV cache
3.75 GB
Activations
2.95 GB
Runtime
1.80 GB
Time to first token (prefill, 512-token prompt): ~930 ms (noticeable)

What if you upgraded?

Hypothetical scenarios. We re-ran the compatibility engine for each.

+32 GB system RAM

~$80–150

Doubles your CPU-offload working set. Helps when models don't quite fit in VRAM.

Unlocks: 32 new tradeoff

  • Qwen 3 30B-A3B
  • Qwen 2.5 Coder 32B Instruct
  • Llama 3.3 70B Instruct
  • Qwen 3 32B

Upgrade to NVIDIA RTX 5000 Ada Generation

see current pricing

32 GB VRAM (vs your 24 GB) plus a bandwidth jump from ~1008 GB/s to ~? GB/s.

Unlocks: 15 new comfortable

  • DeepSeek R1 Distill Qwen 7B
  • Qwen 3 8B
  • Hermes 3 Llama 3.1 8B
  • Gemma 2 9B Instruct

Add a second NVIDIA GeForce RTX 4090

~$1899

Tensor parallelism splits the model across both cards, effectively doubling VRAM. Bandwidth doesn't double — runs ~1.5× the single-card speed in practice.

Unlocks: 16 new comfortable

  • DeepSeek R1 Distill Qwen 7B
  • Qwen 3 8B
  • Hermes 3 Llama 3.1 8B
  • Gemma 2 9B Instruct

Some links above are affiliate links. We may earn a commission at no extra cost to you. How we make money.

Won't run
top 5 popular models

Need more memory than you have. Shown for orientation.

Qwen 3 235B-A22B
235B
qwen
Commercial OK

Even with CPU offload, needs more memory than your VRAM (24 GB) + 60% of system RAM (19 GB) combined.

Llama 4 Scout
109B
llama
Commercial OK

Even with CPU offload, needs more memory than your VRAM (24 GB) + 60% of system RAM (19 GB) combined.

DeepSeek R1 (671B reasoning)
671B
deepseek
Commercial OK

Even with CPU offload, needs more memory than your VRAM (24 GB) + 60% of system RAM (19 GB) combined.

Llama 3.3 70B Instruct
70B
llama
Commercial OK

Even with CPU offload, needs more memory than your VRAM (24 GB) + 60% of system RAM (19 GB) combined.

DeepSeek R1 Distill Llama 70B
70B
deepseek
Commercial OK

Even with CPU offload, needs more memory than your VRAM (24 GB) + 60% of system RAM (19 GB) combined.

How to read these numbers

M
Measured — we ran this exact combo on owner hardware.

~
Extrapolated — predicted from a measured benchmark on similar-bandwidth hardware.

E
Estimated — pure formula based on VRAM bandwidth and model architecture.

Full methodology →

Want a specific benchmark we don't have? Email benchmarks@runlocalai.co and we'll prioritize it.