What can NVIDIA GeForce RTX 3060 12GB run?

Build: RTX 3060 12GB + Ryzen 5 5600 + 32GB DDR4 (cheapest path)

Memory: 12 GB VRAM + 32 GB system RAM
Runner: llama.cpp / Ollama (CUDA)

Runs comfortably
7 models

Full-VRAM resident, with room for context. No compromises.

#1Gemma 4 E2B (Effective 2B)
2B
gemma
Commercial OK
Quant: Q4_K_MContext: 2,048VRAM: 5.4 GBHeadroom: 6.6 GBTTFT: fast
ollama run gemma4:e2b
194
tok/s
E
Weights
1.21 GB
KV cache
0.25 GB
Activations
2.11 GB
Runtime
1.80 GB
Time to first token (prefill, 512-token prompt): ~403 ms (fast)
#2Llama 3.2 3B Instruct
3B
llama
Commercial OK
Quant: Q4_K_MContext: 2,048VRAM: 6.1 GBHeadroom: 5.9 GBTTFT: noticeable
ollama run llama3.2:3b
129
tok/s
E
Weights
1.81 GB
KV cache
0.38 GB
Activations
2.14 GB
Runtime
1.80 GB
Time to first token (prefill, 512-token prompt): ~605 ms (noticeable)
#3Phi-3.5 Mini Instruct
3.8B
phi
Commercial OK
Quant: Q4_K_MContext: 2,048VRAM: 6.7 GBHeadroom: 5.3 GBTTFT: noticeable
ollama run phi3.5:3.8b
102
tok/s
E
Weights
2.29 GB
KV cache
0.47 GB
Activations
2.16 GB
Runtime
1.80 GB
Time to first token (prefill, 512-token prompt): ~766 ms (noticeable)
#4Gemma 4 E4B (Effective 4B)
4B
gemma
Commercial OK
Quant: Q4_K_MContext: 2,048VRAM: 6.9 GBHeadroom: 5.1 GBTTFT: noticeable
ollama run gemma4:e4b
97
tok/s
E
Weights
2.42 GB
KV cache
0.50 GB
Activations
2.17 GB
Runtime
1.80 GB
Time to first token (prefill, 512-token prompt): ~806 ms (noticeable)
#5Qwen 3 4B
4B
qwen
Commercial OK
Quant: Q4_K_MContext: 2,048VRAM: 6.9 GBHeadroom: 5.1 GBTTFT: noticeable
ollama run qwen3:4b
97
tok/s
E
Weights
2.42 GB
KV cache
0.50 GB
Activations
2.17 GB
Runtime
1.80 GB
Time to first token (prefill, 512-token prompt): ~806 ms (noticeable)
#6Gemma 3 4B
4B
gemma
Commercial OK
Quant: Q4_K_MContext: 2,048VRAM: 6.9 GBHeadroom: 5.1 GBTTFT: noticeable
ollama run gemma3:4b
97
tok/s
E
Weights
2.42 GB
KV cache
0.50 GB
Activations
2.17 GB
Runtime
1.80 GB
Time to first token (prefill, 512-token prompt): ~806 ms (noticeable)
#7Phi-3.5 Vision
4.2B
phi
Commercial OK
Quant: Q4_K_MContext: 2,048VRAM: 7.0 GBHeadroom: 5.0 GBTTFT: noticeable
92
tok/s
E
Weights
2.54 GB
KV cache
0.53 GB
Activations
2.17 GB
Runtime
1.80 GB
Time to first token (prefill, 512-token prompt): ~847 ms (noticeable)

Runs with tradeoffs
38 models

Tight VRAM, partial CPU offload, or context-limited.

Llama 3.1 8B Instruct
8B
llama
Commercial OK
Quant: Q4_K_MContext: 2,048VRAM: 9.2 GBHeadroom: 2.8 GBTTFT: noticeable
  • Tight VRAM fit — only 2.8 GB headroom left for context growth
ollama run llama3.1:8b
48
tok/s
E
Weights
4.83 GB
KV cache
0.27 GB
Activations
2.29 GB
Runtime
1.80 GB
Time to first token (prefill, 512-token prompt): ~1613 ms (noticeable)
Qwen 3 30B-A3B
30B
qwen
Commercial OK
Quant: Q4_K_MContext: 2,048VRAM: 26.6 GBHeadroom: 4.6 GBTTFT: slow
  • Partial CPU offload: ~55% of layers run on CPU
  • CPU is the bottleneck — upgrading RAM bandwidth helps more than VRAM here
ollama run qwen3:30b
1
tok/s
E
Weights
18.11 GB
KV cache
3.75 GB
Activations
2.95 GB
Runtime
1.80 GB
Time to first token (prefill, 512-token prompt): ~6047 ms (slow)
Qwen 2.5 Coder 32B Instruct
32B
qwen
Commercial OK
Quant: Q4_K_MContext: 2,048VRAM: 24.7 GBHeadroom: 6.5 GBTTFT: slow
  • Partial CPU offload: ~51% of layers run on CPU
  • CPU is the bottleneck — upgrading RAM bandwidth helps more than VRAM here
ollama run qwen2.5-coder:32b
1
tok/s
E
Weights
19.32 GB
KV cache
0.54 GB
Activations
3.01 GB
Runtime
1.80 GB
Time to first token (prefill, 512-token prompt): ~6450 ms (slow)
Qwen 3 32B
32B
qwen
Commercial OK
Quant: Q4_K_MContext: 2,048VRAM: 28.1 GBHeadroom: 3.1 GBTTFT: slow
  • Partial CPU offload: ~57% of layers run on CPU
  • CPU is the bottleneck — upgrading RAM bandwidth helps more than VRAM here
ollama run qwen3:32b
1
tok/s
E
Weights
19.32 GB
KV cache
4.00 GB
Activations
3.01 GB
Runtime
1.80 GB
Time to first token (prefill, 512-token prompt): ~6450 ms (slow)
Gemma 4 31B Dense
31B
gemma
Commercial OK
Quant: Q4_K_MContext: 2,048VRAM: 27.4 GBHeadroom: 3.8 GBTTFT: slow
  • Partial CPU offload: ~56% of layers run on CPU
  • CPU is the bottleneck — upgrading RAM bandwidth helps more than VRAM here
ollama run gemma4:31b
1
tok/s
E
Weights
18.72 GB
KV cache
3.88 GB
Activations
2.98 GB
Runtime
1.80 GB
Time to first token (prefill, 512-token prompt): ~6249 ms (slow)
Qwen 3 8B
8B
qwen
Commercial OK
Quant: Q4_K_MContext: 2,048VRAM: 9.9 GBHeadroom: 2.1 GBTTFT: noticeable
  • Tight VRAM fit — only 2.1 GB headroom left for context growth
ollama run qwen3:8b
48
tok/s
E
Weights
4.83 GB
KV cache
1.00 GB
Activations
2.29 GB
Runtime
1.80 GB
Time to first token (prefill, 512-token prompt): ~1613 ms (noticeable)
DeepSeek R1 Distill Qwen 32B
32B
deepseek
Commercial OK
Quant: Q4_K_MContext: 2,048VRAM: 28.1 GBHeadroom: 3.1 GBTTFT: slow
  • Partial CPU offload: ~57% of layers run on CPU
  • CPU is the bottleneck — upgrading RAM bandwidth helps more than VRAM here
ollama run deepseek-r1:32b
1
tok/s
E
Weights
19.32 GB
KV cache
4.00 GB
Activations
3.01 GB
Runtime
1.80 GB
Time to first token (prefill, 512-token prompt): ~6450 ms (slow)
Gemma 4 26B MoE
26B
gemma
Commercial OK
Quant: Q4_K_MContext: 2,048VRAM: 23.6 GBHeadroom: 7.6 GBTTFT: slow
  • Partial CPU offload: ~49% of layers run on CPU
  • CPU is the bottleneck — upgrading RAM bandwidth helps more than VRAM here
ollama run gemma4:26b-moe
1
tok/s
E
Weights
15.70 GB
KV cache
3.25 GB
Activations
2.83 GB
Runtime
1.80 GB
Time to first token (prefill, 512-token prompt): ~5241 ms (slow)

What if you upgraded?

Hypothetical scenarios. We re-ran the compatibility engine for each.

+32 GB system RAM

~$80–150

Doubles your CPU-offload working set. Helps when models don't quite fit in VRAM.

Unlocks: 40 new tradeoff

  • Llama 3.1 8B Instruct
  • Qwen 3 30B-A3B
  • Qwen 2.5 Coder 32B Instruct
  • Llama 3.3 70B Instruct

Upgrade to NVIDIA GeForce RTX 4080 Super

~$1099

16 GB VRAM (vs your 12 GB) plus a bandwidth jump from ~360 GB/s to ~736 GB/s.

Unlocks: 10 new comfortable

  • Gemma 3 1B
  • Llama 3.2 1B Instruct
  • DeepSeek R1 Distill Qwen 7B
  • Llama 3.1 8B Instruct

Add a second NVIDIA GeForce RTX 3060 12GB

~$249

Tensor parallelism splits the model across both cards, effectively doubling VRAM. Bandwidth doesn't double — runs ~1.5× the single-card speed in practice.

Unlocks: 13 new comfortable

  • Gemma 3 1B
  • Llama 3.2 1B Instruct
  • Llama 3.1 Nemotron Nano 8B
  • Mistral 7B Instruct v0.3

Some links above are affiliate links. We may earn a commission at no extra cost to you. How we make money.

Won't run
top 5 popular models

Need more memory than you have. Shown for orientation.

Qwen 3 235B-A22B
235B
qwen
Commercial OK

Even with CPU offload, needs more memory than your VRAM (12 GB) + 60% of system RAM (19 GB) combined.

Llama 4 Scout
109B
llama
Commercial OK

Even with CPU offload, needs more memory than your VRAM (12 GB) + 60% of system RAM (19 GB) combined.

DeepSeek R1 (671B reasoning)
671B
deepseek
Commercial OK

Even with CPU offload, needs more memory than your VRAM (12 GB) + 60% of system RAM (19 GB) combined.

Llama 3.3 70B Instruct
70B
llama
Commercial OK

Even with CPU offload, needs more memory than your VRAM (12 GB) + 60% of system RAM (19 GB) combined.

DeepSeek R1 Distill Llama 70B
70B
deepseek
Commercial OK

Even with CPU offload, needs more memory than your VRAM (12 GB) + 60% of system RAM (19 GB) combined.

How to read these numbers

M
Measured — we ran this exact combo on owner hardware.

~
Extrapolated — predicted from a measured benchmark on similar-bandwidth hardware.

E
Estimated — pure formula based on VRAM bandwidth and model architecture.

Full methodology →

Want a specific benchmark we don't have? Email benchmarks@runlocalai.co and we'll prioritize it.