RUNLOCALAIv38
->Will it run?Best GPUCompareTroubleshootStartLearnPulseModelsHardwareToolsBench
Run check
RUNLOCALAI

Independently operated catalog for local-AI hardware and software. Hand-written verdicts. Source-cited claims. Reproducible commands when we have them.

OP·Fredoline Eruo
DIR
  • Models
  • Hardware
  • Tools
  • Benchmarks
TOOLS
  • Will it run?
  • Compare hardware
  • Cost vs cloud
  • Choose my GPU
  • Prompting kits
  • Quick answers
REF
  • All buyer guides
  • Learn local AI
  • Methodology
  • Glossary
  • Errors KB
  • Trust
EDITOR
  • About
  • Author
  • How we make money
  • Editorial policy
  • Contact
LEGAL
  • Privacy
  • Terms
  • Sitemap
MAIL · MONTHLY DIGEST
Get monthly local AI changes
Monthly recap. No spam.
DISCLOSURE

Some links on this site are affiliate links (Amazon Associates and other first-class retailers). When you buy through them, we earn a small commission at no extra cost to you. Affiliate links do not influence our verdicts — there are cards we rate highly that we don't have affiliate relationships with, and cards that sell well that we refuse to recommend. Read more →

© 2026 runlocalai.coIndependently operated
RUNLOCALAI · v38
  1. >
  2. Home
  3. /Will it run?
  4. /Mac Studio M3 Ultra 192GB
Apple unifiedUnified memorybeginner
Pooled memory

What runs on Mac Studio M3 Ultra 192GB?

Apple Silicon flagship with 192 GB unified memory. Genuinely pools — total VRAM ≈ effective VRAM. Trades NVIDIA throughput for the largest model envelope at any reasonable power budget.

At a glance
Effective VRAM
140 / 192 GB
Genuinely pooled
Speed penalty
~0%
vs ideal single-card
Recommended runtime
mlx-lm
none
Setup difficulty
beginner
~370W peak
24
Models fit
6
Borderline
8
Not practical
Deployment recipe
Apple Silicon AI →

Single-Mac deployment recipe with MLX-LM + Ollama. The canonical Apple Silicon path.

Memory budget
Total VRAM
192 GB
Effective for inference
140 GB
73% of total
Genuinely pooled

Apple unified memory genuinely pools — there is no separate GPU VRAM. The CPU, GPU, and Neural Engine all share the same 192 GB pool with ~800 GB/s memory bandwidth. Effective ceiling for inference is ~140 GB because macOS reserves system memory and you need headroom for KV cache and activations. Concretely: a 200B-class model at Q4 (~110 GB weights) fits comfortably with 25-30 GB of context budget. This is the rare case where 'pooled VRAM' is genuine, not marketing. The tradeoff: 800 GB/s bandwidth is 25-30% of an RTX 4090, so tokens-per-second scale lower even though the model fits.

Why total VRAM is not the whole story

Genuinely pooled — 192 GB of unified memory shared between CPU/GPU. Effective ceiling is ~140 GB after OS reservations and KV-cache budget.

See the multi-GPU guide for topology tradeoffs, and the RunLocalAI Will-It-Run Framework for the citable fit-tier method.

Topology

Topology
apple-unified
Interconnect
unified-memory~800 GB/s
Component count
1 unit
Components
  • 1×mac-studio-m3-ultra
Recommended runtime
mlx-lm
Also: llama-cpp, ollama, lm-studio
Recommended split strategy
none
Setup difficulty
beginner
~370W peak

Models that fit comfortably (24)

Effective VRAM utilization ≤ 85% at the smallest production quant. Comfortable headroom for KV cache.

GLM-5 Pro
Fits
144B·AWQ-INT4 → 96 GB·69% of effective VRAM·~0% speed penalty vs ideal
Mixtral 8x22B Instruct
Fits
141B·Q4_K_M → 96 GB·69% of effective VRAM·~0% speed penalty vs ideal
WizardLM-2 8x22B
Fits
141B·Q4_K_M → 96 GB·69% of effective VRAM·~0% speed penalty vs ideal
DBRX Base
Fits
132B·Q4_K_M → 96 GB·69% of effective VRAM·~0% speed penalty vs ideal
DBRX Instruct
Fits
132B·AWQ-INT4 → 96 GB·69% of effective VRAM·~0% speed penalty vs ideal
Mistral Large 2 (123B)
Fits
123B·Q4_K_M → 88 GB·63% of effective VRAM·~0% speed penalty vs ideal
Nemotron 3 Super (120B-A12B)
Fits
120B·Q4_K_M → 84 GB·60% of effective VRAM·~0% speed penalty vs ideal
Llama 4 Scout
Fits
109B·Q4_K_M → 80 GB·57% of effective VRAM·~0% speed penalty vs ideal
Sarvam 105B
Fits
105B·Q4_K_M → 74 GB·53% of effective VRAM·~0% speed penalty vs ideal
Sarvam 105B FP8
Fits
105B·Q4_K_M → 74 GB·53% of effective VRAM·~0% speed penalty vs ideal
Command R+ (Aug 2024)
Fits
104B·AWQ-INT4 → 72 GB·51% of effective VRAM·~0% speed penalty vs ideal
Command R+ 104B
Fits
104B·Q4_K_M → 70 GB·50% of effective VRAM·~0% speed penalty vs ideal
Llama 3.2 90B Vision Instruct
Fits
90B·Q4_K_M → 60 GB·43% of effective VRAM·~0% speed penalty vs ideal
Llama 3.2 90B Vision
Fits
90B·AWQ-INT4 → 64 GB·46% of effective VRAM·~0% speed penalty vs ideal
InternVL 2.5 78B
Fits
78B·Q4_K_M → 52 GB·37% of effective VRAM·~0% speed penalty vs ideal
Molmo 72B
Fits
72B·Q4_K_M → 48 GB·34% of effective VRAM·~0% speed penalty vs ideal
Qwen 2.5 Math 72B
Fits
72B·Q4_K_M → 48 GB·34% of effective VRAM·~0% speed penalty vs ideal
Qwen 2.5 72B Instruct
Fits
72B·Q4_K_M → 48 GB·34% of effective VRAM·~0% speed penalty vs ideal
Qwen 2.5-VL 72B
Fits
72B·AWQ-INT4 → 48 GB·34% of effective VRAM·~0% speed penalty vs ideal
Llama 4 70B
Fits
70B·AWQ-INT4 → 48 GB·34% of effective VRAM·~0% speed penalty vs ideal
Tulu 3 70B
Fits
70B·Q4_K_M → 48 GB·34% of effective VRAM·~0% speed penalty vs ideal
Dolphin 3 Llama 3.3 70B
Fits
70B·AWQ-INT4 → 48 GB·34% of effective VRAM·~0% speed penalty vs ideal
EVA Llama 3.3 70B
Fits
70B·AWQ-INT4 → 48 GB·34% of effective VRAM·~0% speed penalty vs ideal
OpenBioLLM Llama 3 70B
Fits
70B·Q4_K_M → 48 GB·34% of effective VRAM·~0% speed penalty vs ideal

Borderline (6)

Fits but with little headroom. KV cache for long context may not fit; verify before deployment.

Llama 3.1 Nemotron Ultra 253B
Borderline
253B·Q4_K_M → 160 GB·114% of effective VRAM·~0% speed penalty vs ideal

Effective VRAM utilization >114% — KV cache for long context will not fit. Cap context at ~4-8K or move to a larger combo.

DeepSeek V2.5 236B
Borderline
236B·Q4_K_M → 160 GB·114% of effective VRAM·~0% speed penalty vs ideal

Effective VRAM utilization >114% — KV cache for long context will not fit. Cap context at ~4-8K or move to a larger combo.

DeepSeek Coder V2 236B
Borderline
236B·Q4_K_M → 160 GB·114% of effective VRAM·~0% speed penalty vs ideal

Effective VRAM utilization >114% — KV cache for long context will not fit. Cap context at ~4-8K or move to a larger combo.

Qwen 3 235B-A22B
Borderline
235B·Q4_K_M → 160 GB·114% of effective VRAM·~0% speed penalty vs ideal

Effective VRAM utilization >114% — KV cache for long context will not fit. Cap context at ~4-8K or move to a larger combo.

GLM-5
Borderline
200B·Q4_K_M → 140 GB·100% of effective VRAM·~0% speed penalty vs ideal

Effective VRAM utilization >100% — KV cache for long context will not fit. Cap context at ~4-8K or move to a larger combo.

Kimi K1.5
Borderline
200B·AWQ-INT4 → 140 GB·100% of effective VRAM·~0% speed penalty vs ideal

Effective VRAM utilization >100% — KV cache for long context will not fit. Cap context at ~4-8K or move to a larger combo.

Not practical (8)

Model weights exceed effective combo VRAM. Even with the recommended split strategy, this configuration won't run cleanly. Drop to a smaller quant or move to a larger combo.

DeepSeek V4 Pro (1.6T MoE)
Not practical
1600B·Q4_K_M → 1024 GB·731% of effective VRAM·~0% speed penalty vs ideal

Model weights exceed effective combo VRAM. Even with the recommended split strategy, this configuration won't run cleanly.

Step-3
Not practical
1000B·AWQ-INT4 → 640 GB·457% of effective VRAM·~0% speed penalty vs ideal

Model weights exceed effective combo VRAM. Even with the recommended split strategy, this configuration won't run cleanly.

Kimi K2.6
Not practical
1000B·Q4_K_M → 700 GB·500% of effective VRAM·~0% speed penalty vs ideal

Model weights exceed effective combo VRAM. Even with the recommended split strategy, this configuration won't run cleanly.

DeepSeek V4
Not practical
745B·AWQ-INT4 → 480 GB·343% of effective VRAM·~0% speed penalty vs ideal

Model weights exceed effective combo VRAM. Even with the recommended split strategy, this configuration won't run cleanly.

Mistral Medium 3.5 (675B MoE)
Not practical
675B·Q4_K_M → 448 GB·320% of effective VRAM·~0% speed penalty vs ideal

Model weights exceed effective combo VRAM. Even with the recommended split strategy, this configuration won't run cleanly.

DeepSeek R1 (671B reasoning)
Not practical
671B·Q4_K_M → 420 GB·300% of effective VRAM·~0% speed penalty vs ideal

Model weights exceed effective combo VRAM. Even with the recommended split strategy, this configuration won't run cleanly.

DeepSeek V3 (671B MoE)
Not practical
671B·Q4_K_M → 420 GB·300% of effective VRAM·~0% speed penalty vs ideal

Model weights exceed effective combo VRAM. Even with the recommended split strategy, this configuration won't run cleanly.

Llama 4 405B
Not practical
405B·AWQ-INT4 → 280 GB·200% of effective VRAM·~0% speed penalty vs ideal

Model weights exceed effective combo VRAM. Even with the recommended split strategy, this configuration won't run cleanly.

Benchmark opportunities

estimates, not measurements

Pending benchmark targets for this combo. Once measured, results land in the catalog as benchmarks.

Mac Studio M3 Ultra 192GB + Qwen 3.5 235B-A17B (MLX-4bit)
pending
Estimate: 8-14 tok/s decode (single stream)

Apple Silicon at the frontier-MoE envelope. 17B-active makes this fit comfortably in 192GB unified memory. Bandwidth-bound; expect ~25-30% of NVIDIA tok/s but largest fittable model wins.

Going deeper

  • Full combo detail page — operational review with failure modes and runtime matrix.
  • Multi-GPU buying guide — when multi-GPU is worth it and when it isn't.
  • RunLocalAI Will-It-Run Framework — citable effective-VRAM, working-set, fit-tier, and evidence-tier method.
  • Will-it-run home — single-card check + custom builds.