What can Apple Mac Studio (M3 Ultra) run for reasoning?

Build: Apple Mac Studio (M3 Ultra) + — + 32 GB RAM (windows)

Memory: 32 GB unified memory
Runner: llama.cpp (Metal)

Runs comfortably
21 models

Ranked by fit for reasoning use case + predicted speed. Click a row for VRAM breakdown.

#1Gemma 3 1B
1B
gemma
Commercial OK
Quant: Q4_K_MContext: 8,192VRAM: 10.0 GBHeadroom: 14.0 GB
ollama run gemma3:1b
976
tok/s
E
Weights
0.60 GB
KV cache
0.50 GB
Activations
8.22 GB
Runtime
0.70 GB
#2Llama 3.2 1B Instruct
1B
llama
Commercial OK
Quant: Q8_0Context: 8,192VRAM: 10.5 GBHeadroom: 13.5 GB
ollama run llama3.2:1b
554
tok/s
E
Weights
1.06 GB
KV cache
0.50 GB
Activations
8.25 GB
Runtime
0.70 GB
#3Gemma 4 E2B (Effective 2B)
2B
gemma
Commercial OK
Quant: Q8_0Context: 8,192VRAM: 12.1 GBHeadroom: 11.9 GB
ollama run gemma4:e2b
277
tok/s
E
Weights
2.13 GB
KV cache
1.00 GB
Activations
8.30 GB
Runtime
0.70 GB
#4Phi-3.5 Vision
4.2B
phi
Commercial OK
Quant: Q4_K_MContext: 8,192VRAM: 13.7 GBHeadroom: 10.3 GB
232
tok/s
E
Weights
2.54 GB
KV cache
2.10 GB
Activations
8.32 GB
Runtime
0.70 GB
#5Llama 3.1 Nemotron Nano 8B
8B
llama
Commercial OK
Quant: Q4_K_MContext: 8,192VRAM: 18.0 GBHeadroom: 6.0 GB
122
tok/s
E
Weights
4.83 GB
KV cache
4.00 GB
Activations
8.43 GB
Runtime
0.70 GB
#6Llama 3.2 3B Instruct
3B
llama
Commercial OK
Quant: Q8_0Context: 8,192VRAM: 13.7 GBHeadroom: 10.3 GB
ollama run llama3.2:3b
185
tok/s
E
Weights
3.19 GB
KV cache
1.50 GB
Activations
8.35 GB
Runtime
0.70 GB
#7Phi-4 Reasoning 14B
14B
phi
Commercial OK
Quant: Q4_K_MContext: 2,048VRAM: 13.4 GBHeadroom: 10.6 GB
ollama run phi4-reasoning:14b
70
tok/s
E
Weights
8.45 GB
KV cache
1.75 GB
Activations
2.47 GB
Runtime
0.70 GB
#8DeepSeek R1 Distill Qwen 14B
14B
deepseek
Commercial OK
Quant: Q4_K_MContext: 2,048VRAM: 13.4 GBHeadroom: 10.6 GB
ollama run deepseek-r1:14b
70
tok/s
E
Weights
8.45 GB
KV cache
1.75 GB
Activations
2.47 GB
Runtime
0.70 GB
#9Phi-3.5 Mini Instruct
3.8B
phi
Commercial OK
Quant: Q8_0Context: 8,192VRAM: 15.0 GBHeadroom: 9.0 GB
ollama run phi3.5:3.8b
146
tok/s
E
Weights
4.04 GB
KV cache
1.90 GB
Activations
8.39 GB
Runtime
0.70 GB
#10CodeGemma 7B
7B
gemma
Commercial OK
Quant: Q4_K_MContext: 8,192VRAM: 16.8 GBHeadroom: 7.2 GB
ollama run codegemma:7b
139
tok/s
E
Weights
4.23 GB
KV cache
3.50 GB
Activations
8.40 GB
Runtime
0.70 GB
#11Gemma 4 E4B (Effective 4B)
4B
gemma
Commercial OK
Quant: Q8_0Context: 8,192VRAM: 15.4 GBHeadroom: 8.6 GB
ollama run gemma4:e4b
139
tok/s
E
Weights
4.25 GB
KV cache
2.00 GB
Activations
8.40 GB
Runtime
0.70 GB
#12Qwen 3 4B
4B
qwen
Commercial OK
Quant: Q8_0Context: 8,192VRAM: 15.4 GBHeadroom: 8.6 GB
ollama run qwen3:4b
139
tok/s
E
Weights
4.25 GB
KV cache
2.00 GB
Activations
8.40 GB
Runtime
0.70 GB

Runs with tradeoffs
14 models

Tight VRAM, partial CPU offload, or context-limited.

DeepSeek R1 Distill Qwen 7B
7B
deepseek
Commercial OK
Quant: Q8_0Context: 8,192VRAM: 20.2 GBHeadroom: 3.8 GB
  • Tight VRAM fit — only 3.8 GB headroom left for context growth
ollama run deepseek-r1:7b
79
tok/s
E
Weights
7.44 GB
KV cache
3.50 GB
Activations
8.56 GB
Runtime
0.70 GB
Llama 3.2 11B Vision Instruct
11B
llama
Commercial OK
Quant: Q4_K_MContext: 8,192VRAM: 21.4 GBHeadroom: 2.6 GB
  • Tight VRAM fit — only 2.6 GB headroom left for context growth
ollama run llama3.2-vision:11b
89
tok/s
E
Weights
6.64 GB
KV cache
5.50 GB
Activations
8.52 GB
Runtime
0.70 GB
Gemma 3 12B
12B
gemma
Commercial OK
Quant: Q4_K_MContext: 8,192VRAM: 22.5 GBHeadroom: 1.5 GB
  • Tight VRAM fit — only 1.5 GB headroom left for context growth
ollama run gemma3:12b
81
tok/s
E
Weights
7.25 GB
KV cache
6.00 GB
Activations
8.55 GB
Runtime
0.70 GB
Pixtral 12B
12B
mistral
Commercial OK
Quant: Q4_K_MContext: 8,192VRAM: 22.5 GBHeadroom: 1.5 GB
  • Tight VRAM fit — only 1.5 GB headroom left for context growth
ollama run pixtral:12b
81
tok/s
E
Weights
7.25 GB
KV cache
6.00 GB
Activations
8.55 GB
Runtime
0.70 GB
Mistral Nemo 12B Instruct
12B
mistral
Commercial OK
Quant: Q5_K_MContext: 8,192VRAM: 23.6 GBHeadroom: 0.4 GB
  • Tight VRAM fit — only 0.4 GB headroom left for context growth
ollama run mistral-nemo:12b
71
tok/s
E
Weights
8.25 GB
KV cache
6.00 GB
Activations
8.60 GB
Runtime
0.70 GB
Qwen 3 8B
8B
qwen
Commercial OK
Quant: Q8_0Context: 8,192VRAM: 21.8 GBHeadroom: 2.2 GB
  • Tight VRAM fit — only 2.2 GB headroom left for context growth
ollama run qwen3:8b
69
tok/s
E
Weights
8.50 GB
KV cache
4.00 GB
Activations
8.62 GB
Runtime
0.70 GB
Hermes 3 Llama 3.1 8B
8B
hermes
Commercial OK
Quant: Q8_0Context: 8,192VRAM: 21.8 GBHeadroom: 2.2 GB
  • Tight VRAM fit — only 2.2 GB headroom left for context growth
ollama run hermes3:8b
69
tok/s
E
Weights
8.50 GB
KV cache
4.00 GB
Activations
8.62 GB
Runtime
0.70 GB
Gemma 2 9B Instruct
9B
gemma
Commercial OK
Quant: Q8_0Context: 8,192VRAM: 23.4 GBHeadroom: 0.6 GB
  • Tight VRAM fit — only 0.6 GB headroom left for context growth
ollama run gemma2:9b
62
tok/s
E
Weights
9.56 GB
KV cache
4.50 GB
Activations
8.67 GB
Runtime
0.70 GB

What if you upgraded?

Hypothetical scenarios. We re-ran the compatibility engine for each.

Move up an Apple memory tier

~$200–400 over base

On Apple Silicon, more unified memory is the only path forward — VRAM and system RAM are the same pool.

Some links above are affiliate links. We may earn a commission at no extra cost to you. How we make money.

Won't run
top 5 popular models

Need more memory than you have. Shown for orientation.

Qwen 3 235B-A22B
235B
qwen
Commercial OK

Needs ~160 GB unified memory minimum at smallest quant; you have 24 GB available after OS overhead.

Llama 4 Scout
109B
llama
Commercial OK

Needs ~80 GB unified memory minimum at smallest quant; you have 24 GB available after OS overhead.

DeepSeek R1 (671B reasoning)
671B
deepseek
Commercial OK

Needs ~420 GB unified memory minimum at smallest quant; you have 24 GB available after OS overhead.

Qwen 3 30B-A3B
30B
qwen
Commercial OK

Needs ~22 GB unified memory minimum at smallest quant; you have 24 GB available after OS overhead.

Llama 3.3 70B Instruct
70B
llama
Commercial OK

Needs ~48 GB unified memory minimum at smallest quant; you have 24 GB available after OS overhead.

How to read these numbers

M
Measured — we ran this exact combo on owner hardware.

~
Extrapolated — predicted from a measured benchmark on similar-bandwidth hardware.

E
Estimated — pure formula based on VRAM bandwidth and model architecture.

Full methodology →

Want a specific benchmark we don't have? Email benchmarks@runlocalai.co and we'll prioritize it.