What can NVIDIA GeForce RTX 3060 12GB run for coding?

Build: NVIDIA GeForce RTX 3060 12GB + — + 32 GB RAM (windows)

Memory: 12 GB VRAM + 32 GB system RAM
Runner: llama.cpp / Ollama (CUDA)

Runs comfortably
7 models

Ranked by fit for coding use case + predicted speed. Click a row for VRAM breakdown.

#1Gemma 4 E2B (Effective 2B)
2B
gemma
Commercial OK
Quant: Q4_K_MContext: 2,048VRAM: 5.4 GBHeadroom: 6.6 GBTTFT: fast
ollama run gemma4:e2b
194
tok/s
E
Weights
1.21 GB
KV cache
0.25 GB
Activations
2.11 GB
Runtime
1.80 GB
Time to first token (prefill, 512-token prompt): ~403 ms (fast)
#2Llama 3.2 3B Instruct
3B
llama
Commercial OK
Quant: Q4_K_MContext: 2,048VRAM: 6.1 GBHeadroom: 5.9 GBTTFT: noticeable
ollama run llama3.2:3b
129
tok/s
E
Weights
1.81 GB
KV cache
0.38 GB
Activations
2.14 GB
Runtime
1.80 GB
Time to first token (prefill, 512-token prompt): ~605 ms (noticeable)
#3Phi-3.5 Mini Instruct
3.8B
phi
Commercial OK
Quant: Q4_K_MContext: 2,048VRAM: 6.7 GBHeadroom: 5.3 GBTTFT: noticeable
ollama run phi3.5:3.8b
102
tok/s
E
Weights
2.29 GB
KV cache
0.47 GB
Activations
2.16 GB
Runtime
1.80 GB
Time to first token (prefill, 512-token prompt): ~766 ms (noticeable)
#4Gemma 4 E4B (Effective 4B)
4B
gemma
Commercial OK
Quant: Q4_K_MContext: 2,048VRAM: 6.9 GBHeadroom: 5.1 GBTTFT: noticeable
ollama run gemma4:e4b
97
tok/s
E
Weights
2.42 GB
KV cache
0.50 GB
Activations
2.17 GB
Runtime
1.80 GB
Time to first token (prefill, 512-token prompt): ~806 ms (noticeable)
#5Qwen 3 4B
4B
qwen
Commercial OK
Quant: Q4_K_MContext: 2,048VRAM: 6.9 GBHeadroom: 5.1 GBTTFT: noticeable
ollama run qwen3:4b
97
tok/s
E
Weights
2.42 GB
KV cache
0.50 GB
Activations
2.17 GB
Runtime
1.80 GB
Time to first token (prefill, 512-token prompt): ~806 ms (noticeable)
#6Gemma 3 4B
4B
gemma
Commercial OK
Quant: Q4_K_MContext: 2,048VRAM: 6.9 GBHeadroom: 5.1 GBTTFT: noticeable
ollama run gemma3:4b
97
tok/s
E
Weights
2.42 GB
KV cache
0.50 GB
Activations
2.17 GB
Runtime
1.80 GB
Time to first token (prefill, 512-token prompt): ~806 ms (noticeable)
#7Phi-3.5 Vision
4.2B
phi
Commercial OK
Quant: Q4_K_MContext: 2,048VRAM: 7.0 GBHeadroom: 5.0 GBTTFT: noticeable
92
tok/s
E
Weights
2.54 GB
KV cache
0.53 GB
Activations
2.17 GB
Runtime
1.80 GB
Time to first token (prefill, 512-token prompt): ~847 ms (noticeable)

Runs with tradeoffs
38 models

Tight VRAM, partial CPU offload, or context-limited.

Gemma 3 1B
1B
gemma
Commercial OK
Quant: Q4_K_MContext: 8,192VRAM: 11.1 GBHeadroom: 0.9 GBTTFT: fast
  • Tight VRAM fit — only 0.9 GB headroom left for context growth
ollama run gemma3:1b
388
tok/s
E
Weights
0.60 GB
KV cache
0.50 GB
Activations
8.22 GB
Runtime
1.80 GB
Time to first token (prefill, 512-token prompt): ~202 ms (fast)
Llama 3.2 1B Instruct
1B
llama
Commercial OK
Quant: Q8_0Context: 8,192VRAM: 11.6 GBHeadroom: 0.4 GBTTFT: fast
  • Tight VRAM fit — only 0.4 GB headroom left for context growth
ollama run llama3.2:1b
220
tok/s
E
Weights
1.06 GB
KV cache
0.50 GB
Activations
8.25 GB
Runtime
1.80 GB
Time to first token (prefill, 512-token prompt): ~202 ms (fast)
CodeGemma 7B
7B
gemma
Commercial OK
Quant: Q4_K_MContext: 2,048VRAM: 9.2 GBHeadroom: 2.8 GBTTFT: noticeable
  • Tight VRAM fit — only 2.8 GB headroom left for context growth
ollama run codegemma:7b
55
tok/s
E
Weights
4.23 GB
KV cache
0.88 GB
Activations
2.26 GB
Runtime
1.80 GB
Time to first token (prefill, 512-token prompt): ~1411 ms (noticeable)
Qwen 2.5 7B Instruct
7B
qwen
Commercial OK
Quant: Q4_K_MContext: 2,048VRAM: 8.4 GBHeadroom: 3.6 GBTTFT: noticeable
  • Tight VRAM fit — only 3.6 GB headroom left for context growth
ollama run qwen2.5:7b
55
tok/s
E
Weights
4.23 GB
KV cache
0.12 GB
Activations
2.26 GB
Runtime
1.80 GB
Time to first token (prefill, 512-token prompt): ~1411 ms (noticeable)
DeepSeek Coder V2 Lite (16B)
16B
deepseek
Commercial OK
Quant: Q4_K_MContext: 8,192VRAM: 28.1 GBHeadroom: 3.1 GBTTFT: slow
  • Partial CPU offload: ~57% of layers run on CPU
ollama run deepseek-coder-v2:16b
24
tok/s
E
Weights
9.66 GB
KV cache
8.00 GB
Activations
8.68 GB
Runtime
1.80 GB
Time to first token (prefill, 512-token prompt): ~3225 ms (slow)
Llama 3.1 8B Instruct
8B
llama
Commercial OK
Quant: Q4_K_MContext: 2,048VRAM: 9.2 GBHeadroom: 2.8 GBTTFT: noticeable
  • Tight VRAM fit — only 2.8 GB headroom left for context growth
ollama run llama3.1:8b
48
tok/s
E
Weights
4.83 GB
KV cache
0.27 GB
Activations
2.29 GB
Runtime
1.80 GB
Time to first token (prefill, 512-token prompt): ~1613 ms (noticeable)
Qwen 3 8B
8B
qwen
Commercial OK
Quant: Q4_K_MContext: 2,048VRAM: 9.9 GBHeadroom: 2.1 GBTTFT: noticeable
  • Tight VRAM fit — only 2.1 GB headroom left for context growth
ollama run qwen3:8b
48
tok/s
E
Weights
4.83 GB
KV cache
1.00 GB
Activations
2.29 GB
Runtime
1.80 GB
Time to first token (prefill, 512-token prompt): ~1613 ms (noticeable)
Codestral 22B
22B
mistral
Quant: Q4_K_MContext: 2,048VRAM: 20.5 GBHeadroom: 10.7 GBTTFT: slow
  • Partial CPU offload: ~42% of layers run on CPU
ollama run codestral:22b
18
tok/s
E
Weights
13.28 GB
KV cache
2.75 GB
Activations
2.71 GB
Runtime
1.80 GB
Time to first token (prefill, 512-token prompt): ~4435 ms (slow)

What if you upgraded?

Hypothetical scenarios. We re-ran the compatibility engine for each.

+32 GB system RAM

~$80–150

Doubles your CPU-offload working set. Helps when models don't quite fit in VRAM.

Unlocks: 40 new tradeoff

  • Llama 3.1 8B Instruct
  • Qwen 3 30B-A3B
  • Qwen 2.5 Coder 32B Instruct
  • Llama 3.3 70B Instruct

Upgrade to NVIDIA GeForce RTX 4080 Super

~$1099

16 GB VRAM (vs your 12 GB) plus a bandwidth jump from ~360 GB/s to ~736 GB/s.

Unlocks: 10 new comfortable

  • Gemma 3 1B
  • Llama 3.2 1B Instruct
  • DeepSeek R1 Distill Qwen 7B
  • Llama 3.1 8B Instruct

Add a second NVIDIA GeForce RTX 3060 12GB

~$249

Tensor parallelism splits the model across both cards, effectively doubling VRAM. Bandwidth doesn't double — runs ~1.5× the single-card speed in practice.

Unlocks: 13 new comfortable

  • Gemma 3 1B
  • Llama 3.2 1B Instruct
  • Llama 3.1 Nemotron Nano 8B
  • Mistral 7B Instruct v0.3

Some links above are affiliate links. We may earn a commission at no extra cost to you. How we make money.

Won't run
top 5 popular models

Need more memory than you have. Shown for orientation.

Qwen 3 235B-A22B
235B
qwen
Commercial OK

Even with CPU offload, needs more memory than your VRAM (12 GB) + 60% of system RAM (19 GB) combined.

Llama 4 Scout
109B
llama
Commercial OK

Even with CPU offload, needs more memory than your VRAM (12 GB) + 60% of system RAM (19 GB) combined.

DeepSeek R1 (671B reasoning)
671B
deepseek
Commercial OK

Even with CPU offload, needs more memory than your VRAM (12 GB) + 60% of system RAM (19 GB) combined.

Llama 3.3 70B Instruct
70B
llama
Commercial OK

Even with CPU offload, needs more memory than your VRAM (12 GB) + 60% of system RAM (19 GB) combined.

DeepSeek R1 Distill Llama 70B
70B
deepseek
Commercial OK

Even with CPU offload, needs more memory than your VRAM (12 GB) + 60% of system RAM (19 GB) combined.

How to read these numbers

M
Measured — we ran this exact combo on owner hardware.

~
Extrapolated — predicted from a measured benchmark on similar-bandwidth hardware.

E
Estimated — pure formula based on VRAM bandwidth and model architecture.

Full methodology →

Want a specific benchmark we don't have? Email benchmarks@runlocalai.co and we'll prioritize it.