What can NVIDIA GeForce RTX 5090 run?
Build: RTX 5090 + Ryzen 9 9950X + 64GB DDR5
Runs comfortably34 models
Full-VRAM resident, with room for context. No compromises.
Quant: Q4_K_MContext: 8,192VRAM: 11.1 GBHeadroom: 20.9 GBTTFT: instantollama run gemma3:1b1929tok/sE
ollama run gemma3:1bQuant: Q8_0Context: 8,192VRAM: 11.6 GBHeadroom: 20.4 GBTTFT: instantollama run llama3.2:1b1096tok/sE
ollama run llama3.2:1bQuant: Q8_0Context: 8,192VRAM: 13.2 GBHeadroom: 18.8 GBTTFT: instantollama run gemma4:e2b548tok/sE
ollama run gemma4:e2bQuant: Q8_0Context: 8,192VRAM: 14.8 GBHeadroom: 17.2 GBTTFT: instantollama run llama3.2:3b365tok/sE
ollama run llama3.2:3bQuant: Q4_K_MContext: 8,192VRAM: 14.8 GBHeadroom: 17.2 GBTTFT: instant459tok/sE
Quant: Q8_0Context: 8,192VRAM: 16.1 GBHeadroom: 15.9 GBTTFT: instantollama run phi3.5:3.8b288tok/sE
ollama run phi3.5:3.8bQuant: Q8_0Context: 8,192VRAM: 16.5 GBHeadroom: 15.5 GBTTFT: instantollama run gemma4:e4b274tok/sE
ollama run gemma4:e4bQuant: Q8_0Context: 8,192VRAM: 16.5 GBHeadroom: 15.5 GBTTFT: instantollama run qwen3:4b274tok/sE
ollama run qwen3:4bQuant: Q8_0Context: 8,192VRAM: 16.5 GBHeadroom: 15.5 GBTTFT: instantollama run gemma3:4b274tok/sE
ollama run gemma3:4bQuant: Q4_K_MContext: 8,192VRAM: 19.1 GBHeadroom: 12.9 GBTTFT: fast241tok/sE
Quant: Q5_K_MContext: 8,192VRAM: 18.5 GBHeadroom: 13.5 GBTTFT: fastollama run mistral:7b242tok/sE
ollama run mistral:7bQuant: Q4_K_MContext: 8,192VRAM: 17.9 GBHeadroom: 14.1 GBTTFT: fastollama run codegemma:7b276tok/sE
ollama run codegemma:7bRuns with tradeoffs18 models
Tight VRAM, partial CPU offload, or context-limited.
Quant: Q5_K_MContext: 8,192VRAM: 63.2 GBHeadroom: 7.2 GBTTFT: noticeable- • Partial CPU offload: ~49% of layers run on CPU
- • CPU is the bottleneck — upgrading RAM bandwidth helps more than VRAM here
ollama run llama3.3:70b1tok/sE
- • Partial CPU offload: ~49% of layers run on CPU
- • CPU is the bottleneck — upgrading RAM bandwidth helps more than VRAM here
ollama run llama3.3:70bQuant: Q4_K_MContext: 2,048VRAM: 28.1 GBHeadroom: 3.9 GBTTFT: noticeable- • Tight VRAM fit — only 3.9 GB headroom left for context growth
ollama run qwen3:32b60tok/sE
- • Tight VRAM fit — only 3.9 GB headroom left for context growth
ollama run qwen3:32bQuant: Q4_K_MContext: 2,048VRAM: 57.0 GBHeadroom: 13.4 GBTTFT: noticeable- • Partial CPU offload: ~44% of layers run on CPU
- • CPU is the bottleneck — upgrading RAM bandwidth helps more than VRAM here
ollama run deepseek-r1:70b2tok/sE
- • Partial CPU offload: ~44% of layers run on CPU
- • CPU is the bottleneck — upgrading RAM bandwidth helps more than VRAM here
ollama run deepseek-r1:70bQuant: Q4_K_MContext: 2,048VRAM: 28.1 GBHeadroom: 3.9 GBTTFT: noticeable- • Tight VRAM fit — only 3.9 GB headroom left for context growth
ollama run deepseek-r1:32b60tok/sE
- • Tight VRAM fit — only 3.9 GB headroom left for context growth
ollama run deepseek-r1:32bQuant: Q4_K_MContext: 2,048VRAM: 57.0 GBHeadroom: 13.4 GBTTFT: noticeable- • Partial CPU offload: ~44% of layers run on CPU
- • CPU is the bottleneck — upgrading RAM bandwidth helps more than VRAM here
ollama run llama3.1:70b2tok/sE
- • Partial CPU offload: ~44% of layers run on CPU
- • CPU is the bottleneck — upgrading RAM bandwidth helps more than VRAM here
ollama run llama3.1:70bQuant: Q8_0Context: 8,192VRAM: 29.4 GBHeadroom: 2.6 GBTTFT: fast- • Tight VRAM fit — only 2.6 GB headroom left for context growth
ollama run mistral-nemo:12b91tok/sE
- • Tight VRAM fit — only 2.6 GB headroom left for context growth
ollama run mistral-nemo:12bQuant: Q4_K_MContext: 2,048VRAM: 28.1 GBHeadroom: 3.9 GBTTFT: noticeable- • Tight VRAM fit — only 3.9 GB headroom left for context growth
ollama run qwen2.5:32b60tok/sE
- • Tight VRAM fit — only 3.9 GB headroom left for context growth
ollama run qwen2.5:32bQuant: Q4_K_MContext: 2,048VRAM: 28.1 GBHeadroom: 3.9 GBTTFT: noticeable- • Tight VRAM fit — only 3.9 GB headroom left for context growth
ollama run qwq:32b60tok/sE
- • Tight VRAM fit — only 3.9 GB headroom left for context growth
ollama run qwq:32bWhat if you upgraded?
Hypothetical scenarios. We re-ran the compatibility engine for each.
+32 GB system RAM
~$80–150
Doubles your CPU-offload working set. Helps when models don't quite fit in VRAM.
Unlocks: 21 new tradeoff
- • Llama 4 Scout
- • Llama 3.3 70B Instruct
- • Qwen 3 32B
- • DeepSeek R1 Distill Llama 70B
Upgrade to NVIDIA A100 40GB
see current pricing
40 GB VRAM (vs your 32 GB) plus a bandwidth jump from ~1792 GB/s to ~? GB/s.
Unlocks: 11 new comfortable
- • DeepSeek Coder V2 Lite (16B)
- • Mistral Nemo 12B Instruct
- • Gemma 3 12B
- • Pixtral 12B
Add a second NVIDIA GeForce RTX 5090
~$2499
Tensor parallelism splits the model across both cards, effectively doubling VRAM. Bandwidth doesn't double — runs ~1.5× the single-card speed in practice.
Unlocks: 13 new comfortable
- • DeepSeek Coder V2 Lite (16B)
- • Mistral Nemo 12B Instruct
- • Gemma 3 12B
- • Pixtral 12B
Some links above are affiliate links. We may earn a commission at no extra cost to you. How we make money.
Won't runtop 5 popular models
Need more memory than you have. Shown for orientation.
Even with CPU offload, needs more memory than your VRAM (32 GB) + 60% of system RAM (38 GB) combined.
—
Even with CPU offload, needs more memory than your VRAM (32 GB) + 60% of system RAM (38 GB) combined.
Even with CPU offload, needs more memory than your VRAM (32 GB) + 60% of system RAM (38 GB) combined.
—
Even with CPU offload, needs more memory than your VRAM (32 GB) + 60% of system RAM (38 GB) combined.
Even with CPU offload, needs more memory than your VRAM (32 GB) + 60% of system RAM (38 GB) combined.
—
Even with CPU offload, needs more memory than your VRAM (32 GB) + 60% of system RAM (38 GB) combined.
Even with CPU offload, needs more memory than your VRAM (32 GB) + 60% of system RAM (38 GB) combined.
—
Even with CPU offload, needs more memory than your VRAM (32 GB) + 60% of system RAM (38 GB) combined.
Even with CPU offload, needs more memory than your VRAM (32 GB) + 60% of system RAM (38 GB) combined.
—
Even with CPU offload, needs more memory than your VRAM (32 GB) + 60% of system RAM (38 GB) combined.
How to read these numbers
Want a specific benchmark we don't have? Email benchmarks@runlocalai.co and we'll prioritize it.