What runs on Mac Studio M3 Ultra 192GB?
Apple Silicon flagship with 192 GB unified memory. Genuinely pools — total VRAM ≈ effective VRAM. Trades NVIDIA throughput for the largest model envelope at any reasonable power budget.
Single-Mac deployment recipe with MLX-LM + Ollama. The canonical Apple Silicon path.
Apple unified memory genuinely pools — there is no separate GPU VRAM. The CPU, GPU, and Neural Engine all share the same 192 GB pool with ~800 GB/s memory bandwidth. Effective ceiling for inference is ~140 GB because macOS reserves system memory and you need headroom for KV cache and activations. Concretely: a 200B-class model at Q4 (~110 GB weights) fits comfortably with 25-30 GB of context budget. This is the rare case where 'pooled VRAM' is genuine, not marketing. The tradeoff: 800 GB/s bandwidth is 25-30% of an RTX 4090, so tokens-per-second scale lower even though the model fits.
Genuinely pooled — 192 GB of unified memory shared between CPU/GPU. Effective ceiling is ~140 GB after OS reservations and KV-cache budget.
See the multi-GPU guide for topology tradeoffs, and the RunLocalAI Will-It-Run Framework for the citable fit-tier method.
Topology
Models that fit comfortably (24)
Effective VRAM utilization ≤ 85% at the smallest production quant. Comfortable headroom for KV cache.
Borderline (6)
Fits but with little headroom. KV cache for long context may not fit; verify before deployment.
Effective VRAM utilization >114% — KV cache for long context will not fit. Cap context at ~4-8K or move to a larger combo.
Effective VRAM utilization >114% — KV cache for long context will not fit. Cap context at ~4-8K or move to a larger combo.
Effective VRAM utilization >114% — KV cache for long context will not fit. Cap context at ~4-8K or move to a larger combo.
Effective VRAM utilization >114% — KV cache for long context will not fit. Cap context at ~4-8K or move to a larger combo.
Effective VRAM utilization >100% — KV cache for long context will not fit. Cap context at ~4-8K or move to a larger combo.
Effective VRAM utilization >100% — KV cache for long context will not fit. Cap context at ~4-8K or move to a larger combo.
Not practical (8)
Model weights exceed effective combo VRAM. Even with the recommended split strategy, this configuration won't run cleanly. Drop to a smaller quant or move to a larger combo.
Model weights exceed effective combo VRAM. Even with the recommended split strategy, this configuration won't run cleanly.
Model weights exceed effective combo VRAM. Even with the recommended split strategy, this configuration won't run cleanly.
Model weights exceed effective combo VRAM. Even with the recommended split strategy, this configuration won't run cleanly.
Model weights exceed effective combo VRAM. Even with the recommended split strategy, this configuration won't run cleanly.
Model weights exceed effective combo VRAM. Even with the recommended split strategy, this configuration won't run cleanly.
Model weights exceed effective combo VRAM. Even with the recommended split strategy, this configuration won't run cleanly.
Model weights exceed effective combo VRAM. Even with the recommended split strategy, this configuration won't run cleanly.
Model weights exceed effective combo VRAM. Even with the recommended split strategy, this configuration won't run cleanly.
Benchmark opportunities
estimates, not measurementsPending benchmark targets for this combo. Once measured, results land in the catalog as benchmarks.
Apple Silicon at the frontier-MoE envelope. 17B-active makes this fit comfortably in 192GB unified memory. Bandwidth-bound; expect ~25-30% of NVIDIA tok/s but largest fittable model wins.
Going deeper
- Full combo detail page — operational review with failure modes and runtime matrix.
- Multi-GPU buying guide — when multi-GPU is worth it and when it isn't.
- RunLocalAI Will-It-Run Framework — citable effective-VRAM, working-set, fit-tier, and evidence-tier method.
- Will-it-run home — single-card check + custom builds.