Apple unifiedUnified memorybeginner

Pooled memory

What runs on Mac Studio M3 Ultra 192GB?

Apple Silicon flagship with 192 GB unified memory. Genuinely pools — total VRAM ≈ effective VRAM. Trades NVIDIA throughput for the largest model envelope at any reasonable power budget.

At a glance

Effective VRAM

140 / 192 GB

Genuinely pooled

Speed penalty

~0%

vs ideal single-card

Recommended runtime

mlx-lm

none

Setup difficulty

beginner

~370W peak

Models fit

Borderline

Not practical

Deployment recipe

Apple Silicon AI →

Single-Mac deployment recipe with MLX-LM + Ollama. The canonical Apple Silicon path.

Memory budget

Total VRAM

192 GB

Effective for inference

140 GB

73% of total

Genuinely pooled

Apple unified memory genuinely pools — there is no separate GPU VRAM. The CPU, GPU, and Neural Engine all share the same 192 GB pool with ~800 GB/s memory bandwidth. Effective ceiling for inference is ~140 GB because macOS reserves system memory and you need headroom for KV cache and activations. Concretely: a 200B-class model at Q4 (~110 GB weights) fits comfortably with 25-30 GB of context budget. This is the rare case where 'pooled VRAM' is genuine, not marketing. The tradeoff: 800 GB/s bandwidth is 25-30% of an RTX 4090, so tokens-per-second scale lower even though the model fits.

Why total VRAM is not the whole story

Genuinely pooled — 192 GB of unified memory shared between CPU/GPU. Effective ceiling is ~140 GB after OS reservations and KV-cache budget.

See the multi-GPU guide for topology tradeoffs, and the RunLocalAI Will-It-Run Framework for the citable fit-tier method.

Topology

apple-unified

Interconnect

unified-memory~800 GB/s

Component count

1 unit

Components

1×mac-studio-m3-ultra

Recommended runtime

mlx-lm

Also: llama-cpp, ollama, lm-studio

Recommended split strategy

none

Setup difficulty

beginner

~370W peak

Models that fit comfortably (24)

Effective VRAM utilization ≤ 85% at the smallest production quant. Comfortable headroom for KV cache.

GLM-5 Pro

Fits

144B·AWQ-INT4 → 96 GB·69% of effective VRAM·~0% speed penalty vs ideal

Mixtral 8x22B Instruct

Fits

141B·Q4_K_M → 96 GB·69% of effective VRAM·~0% speed penalty vs ideal

WizardLM-2 8x22B

Fits

141B·Q4_K_M → 96 GB·69% of effective VRAM·~0% speed penalty vs ideal

DBRX Base

Fits

132B·Q4_K_M → 96 GB·69% of effective VRAM·~0% speed penalty vs ideal

DBRX Instruct

Fits

132B·AWQ-INT4 → 96 GB·69% of effective VRAM·~0% speed penalty vs ideal

Mistral Large 2 (123B)

Fits

123B·Q4_K_M → 88 GB·63% of effective VRAM·~0% speed penalty vs ideal

Nemotron 3 Super (120B-A12B)

Fits

120B·Q4_K_M → 84 GB·60% of effective VRAM·~0% speed penalty vs ideal

Llama 4 Scout

Fits

109B·Q4_K_M → 80 GB·57% of effective VRAM·~0% speed penalty vs ideal

Sarvam 105B

Fits

105B·Q4_K_M → 74 GB·53% of effective VRAM·~0% speed penalty vs ideal

Sarvam 105B FP8

Fits

105B·Q4_K_M → 74 GB·53% of effective VRAM·~0% speed penalty vs ideal

Command R+ (Aug 2024)

Fits

104B·AWQ-INT4 → 72 GB·51% of effective VRAM·~0% speed penalty vs ideal

Command R+ 104B

Fits

104B·Q4_K_M → 70 GB·50% of effective VRAM·~0% speed penalty vs ideal

Llama 3.2 90B Vision Instruct

Fits

90B·Q4_K_M → 60 GB·43% of effective VRAM·~0% speed penalty vs ideal

Llama 3.2 90B Vision

Fits

90B·AWQ-INT4 → 64 GB·46% of effective VRAM·~0% speed penalty vs ideal

InternVL 2.5 78B

Fits

78B·Q4_K_M → 52 GB·37% of effective VRAM·~0% speed penalty vs ideal

Molmo 72B

Fits

72B·Q4_K_M → 48 GB·34% of effective VRAM·~0% speed penalty vs ideal

Qwen 2.5 Math 72B

Fits

72B·Q4_K_M → 48 GB·34% of effective VRAM·~0% speed penalty vs ideal

Qwen 2.5 72B Instruct

Fits

72B·Q4_K_M → 48 GB·34% of effective VRAM·~0% speed penalty vs ideal

Qwen 2.5-VL 72B

Fits

72B·AWQ-INT4 → 48 GB·34% of effective VRAM·~0% speed penalty vs ideal

Llama 4 70B

Fits

70B·AWQ-INT4 → 48 GB·34% of effective VRAM·~0% speed penalty vs ideal

Tulu 3 70B

Fits

70B·Q4_K_M → 48 GB·34% of effective VRAM·~0% speed penalty vs ideal

Dolphin 3 Llama 3.3 70B

Fits

70B·AWQ-INT4 → 48 GB·34% of effective VRAM·~0% speed penalty vs ideal

EVA Llama 3.3 70B

Fits

70B·AWQ-INT4 → 48 GB·34% of effective VRAM·~0% speed penalty vs ideal

OpenBioLLM Llama 3 70B

Fits

70B·Q4_K_M → 48 GB·34% of effective VRAM·~0% speed penalty vs ideal

Borderline (6)

Fits but with little headroom. KV cache for long context may not fit; verify before deployment.

Llama 3.1 Nemotron Ultra 253B

Borderline

253B·Q4_K_M → 160 GB·114% of effective VRAM·~0% speed penalty vs ideal

Effective VRAM utilization >114% — KV cache for long context will not fit. Cap context at ~4-8K or move to a larger combo.

DeepSeek V2.5 236B

Borderline

236B·Q4_K_M → 160 GB·114% of effective VRAM·~0% speed penalty vs ideal

Effective VRAM utilization >114% — KV cache for long context will not fit. Cap context at ~4-8K or move to a larger combo.

DeepSeek Coder V2 236B

Borderline

236B·Q4_K_M → 160 GB·114% of effective VRAM·~0% speed penalty vs ideal

Effective VRAM utilization >114% — KV cache for long context will not fit. Cap context at ~4-8K or move to a larger combo.

Qwen 3 235B-A22B

Borderline

235B·Q4_K_M → 160 GB·114% of effective VRAM·~0% speed penalty vs ideal

Effective VRAM utilization >114% — KV cache for long context will not fit. Cap context at ~4-8K or move to a larger combo.

GLM-5

Borderline

200B·Q4_K_M → 140 GB·100% of effective VRAM·~0% speed penalty vs ideal

Effective VRAM utilization >100% — KV cache for long context will not fit. Cap context at ~4-8K or move to a larger combo.

Kimi K1.5

Borderline

200B·AWQ-INT4 → 140 GB·100% of effective VRAM·~0% speed penalty vs ideal

Effective VRAM utilization >100% — KV cache for long context will not fit. Cap context at ~4-8K or move to a larger combo.

Not practical (8)

Model weights exceed effective combo VRAM. Even with the recommended split strategy, this configuration won't run cleanly. Drop to a smaller quant or move to a larger combo.

DeepSeek V4 Pro (1.6T MoE)

Not practical

1600B·Q4_K_M → 1024 GB·731% of effective VRAM·~0% speed penalty vs ideal