What can NVIDIA GeForce RTX 3090 run for reasoning?
Build: NVIDIA GeForce RTX 3090 + — + 32 GB RAM (windows)
Runs comfortably20 models
Ranked by fit for reasoning use case + predicted speed. Click a row for VRAM breakdown.
Quant: Q4_K_MContext: 8,192VRAM: 11.1 GBHeadroom: 12.9 GBollama run gemma3:1b1023tok/sE
ollama run gemma3:1bQuant: Q8_0Context: 8,192VRAM: 11.6 GBHeadroom: 12.4 GBollama run llama3.2:1b581tok/sE
ollama run llama3.2:1bQuant: Q8_0Context: 8,192VRAM: 13.2 GBHeadroom: 10.8 GBollama run gemma4:e2b291tok/sE
ollama run gemma4:e2bQuant: Q4_K_MContext: 8,192VRAM: 14.8 GBHeadroom: 9.2 GB244tok/sE
Quant: Q4_K_MContext: 8,192VRAM: 19.1 GBHeadroom: 4.9 GB128tok/sE
Quant: Q8_0Context: 8,192VRAM: 14.8 GBHeadroom: 9.2 GBollama run llama3.2:3b194tok/sE
ollama run llama3.2:3bQuant: Q4_K_MContext: 2,048VRAM: 14.5 GBHeadroom: 9.5 GBollama run phi4-reasoning:14b73tok/sE
ollama run phi4-reasoning:14bQuant: Q4_K_MContext: 2,048VRAM: 14.5 GBHeadroom: 9.5 GBollama run deepseek-r1:14b73tok/sE
ollama run deepseek-r1:14bQuant: Q8_0Context: 8,192VRAM: 16.1 GBHeadroom: 7.9 GBollama run phi3.5:3.8b153tok/sE
ollama run phi3.5:3.8bQuant: Q4_K_MContext: 8,192VRAM: 17.9 GBHeadroom: 6.1 GBollama run codegemma:7b146tok/sE
ollama run codegemma:7bQuant: Q8_0Context: 8,192VRAM: 16.5 GBHeadroom: 7.5 GBollama run gemma4:e4b145tok/sE
ollama run gemma4:e4bQuant: Q8_0Context: 8,192VRAM: 16.5 GBHeadroom: 7.5 GBollama run qwen3:4b145tok/sE
ollama run qwen3:4bRuns with tradeoffs26 models
Tight VRAM, partial CPU offload, or context-limited.
Quant: Q8_0Context: 8,192VRAM: 21.3 GBHeadroom: 2.7 GB- • Tight VRAM fit — only 2.7 GB headroom left for context growth
ollama run deepseek-r1:7b83tok/sE
- • Tight VRAM fit — only 2.7 GB headroom left for context growth
ollama run deepseek-r1:7bQuant: Q4_K_MContext: 2,048VRAM: 28.1 GBHeadroom: 15.1 GB- • Partial CPU offload: ~15% of layers run on CPU
ollama run deepseek-r1:32b32tok/sE
- • Partial CPU offload: ~15% of layers run on CPU
ollama run deepseek-r1:32bQuant: Q4_K_MContext: 2,048VRAM: 28.1 GBHeadroom: 15.1 GB- • Partial CPU offload: ~15% of layers run on CPU
ollama run qwq:32b32tok/sE
- • Partial CPU offload: ~15% of layers run on CPU
ollama run qwq:32bQuant: Q4_K_MContext: 8,192VRAM: 20.2 GBHeadroom: 3.8 GB- • Tight VRAM fit — only 3.8 GB headroom left for context growth
ollama run gemma2:9b114tok/sE
- • Tight VRAM fit — only 3.8 GB headroom left for context growth
ollama run gemma2:9bQuant: Q4_K_MContext: 8,192VRAM: 22.5 GBHeadroom: 1.5 GB- • Tight VRAM fit — only 1.5 GB headroom left for context growth
ollama run llama3.2-vision:11b93tok/sE
- • Tight VRAM fit — only 1.5 GB headroom left for context growth
ollama run llama3.2-vision:11bQuant: Q4_K_MContext: 2,048VRAM: 26.6 GBHeadroom: 16.6 GB- • Partial CPU offload: ~10% of layers run on CPU
ollama run nemotron3:nano34tok/sE
- • Partial CPU offload: ~10% of layers run on CPU
ollama run nemotron3:nanoQuant: Q4_K_MContext: 8,192VRAM: 23.6 GBHeadroom: 0.4 GB- • Tight VRAM fit — only 0.4 GB headroom left for context growth
ollama run mistral-nemo:12b85tok/sE
- • Tight VRAM fit — only 0.4 GB headroom left for context growth
ollama run mistral-nemo:12bQuant: Q4_K_MContext: 8,192VRAM: 23.6 GBHeadroom: 0.4 GB- • Tight VRAM fit — only 0.4 GB headroom left for context growth
ollama run gemma3:12b85tok/sE
- • Tight VRAM fit — only 0.4 GB headroom left for context growth
ollama run gemma3:12bWhat if you upgraded?
Hypothetical scenarios. We re-ran the compatibility engine for each.
+32 GB system RAM
~$80–150
Doubles your CPU-offload working set. Helps when models don't quite fit in VRAM.
Unlocks: 32 new tradeoff
- • Qwen 3 30B-A3B
- • Qwen 2.5 Coder 32B Instruct
- • Llama 3.3 70B Instruct
- • Qwen 3 32B
Upgrade to NVIDIA RTX 5000 Ada Generation
see current pricing
32 GB VRAM (vs your 24 GB) plus a bandwidth jump from ~? GB/s to ~? GB/s.
Unlocks: 15 new comfortable
- • DeepSeek R1 Distill Qwen 7B
- • Qwen 3 8B
- • Hermes 3 Llama 3.1 8B
- • Gemma 2 9B Instruct
Add a second NVIDIA GeForce RTX 3090
~$899
Tensor parallelism splits the model across both cards, effectively doubling VRAM. Bandwidth doesn't double — runs ~1.5× the single-card speed in practice.
Unlocks: 16 new comfortable
- • DeepSeek R1 Distill Qwen 7B
- • Qwen 3 8B
- • Hermes 3 Llama 3.1 8B
- • Gemma 2 9B Instruct
Some links above are affiliate links. We may earn a commission at no extra cost to you. How we make money.
Won't runtop 5 popular models
Need more memory than you have. Shown for orientation.
Even with CPU offload, needs more memory than your VRAM (24 GB) + 60% of system RAM (19 GB) combined.
—
Even with CPU offload, needs more memory than your VRAM (24 GB) + 60% of system RAM (19 GB) combined.
Even with CPU offload, needs more memory than your VRAM (24 GB) + 60% of system RAM (19 GB) combined.
—
Even with CPU offload, needs more memory than your VRAM (24 GB) + 60% of system RAM (19 GB) combined.
Even with CPU offload, needs more memory than your VRAM (24 GB) + 60% of system RAM (19 GB) combined.
—
Even with CPU offload, needs more memory than your VRAM (24 GB) + 60% of system RAM (19 GB) combined.
Even with CPU offload, needs more memory than your VRAM (24 GB) + 60% of system RAM (19 GB) combined.
—
Even with CPU offload, needs more memory than your VRAM (24 GB) + 60% of system RAM (19 GB) combined.
Even with CPU offload, needs more memory than your VRAM (24 GB) + 60% of system RAM (19 GB) combined.
—
Even with CPU offload, needs more memory than your VRAM (24 GB) + 60% of system RAM (19 GB) combined.
How to read these numbers
Want a specific benchmark we don't have? Email benchmarks@runlocalai.co and we'll prioritize it.