Apple clusterThunderboltexpert

What runs on 4× Mac Mini M4 Pro Exo cluster (256 GB total)?

Four Mac Mini M4 Pro nodes with 64 GB unified memory each, connected via Thunderbolt 5. Exo distributes layers across machines. 256 GB total / ~180 GB effective for inference.

At a glance

Effective VRAM

180 / 256 GB

Not pooled

Speed penalty

~60%

vs ideal single-card

Recommended runtime

exo

pipeline parallel

Setup difficulty

expert

~600W peak

Models fit

Borderline

Not practical

Deployment recipe

Multi-machine Apple cluster →

Exo-based multi-Mac sharding over Thunderbolt 5 — the cluster recipe for >192GB unified memory targets.

Memory budget

Total VRAM

256 GB

Effective for inference

180 GB

70% of total

Not pooled

Exo clusters 4 Macs into a single inference target by sharding model layers across machines. Total memory is 4× 64 = 256 GB, but each node reserves OS overhead and KV-cache buffers, and inter-node communication costs ~10-15% effective capacity. Concretely: a 200B-class model at Q4 (~110 GB) distributes across 4 nodes with ~25-30 GB per node, leaving comfortable headroom on each. Thunderbolt 5 (80 Gbps bidirectional, 120 Gbps in display mode) is the communication path — meaningfully faster than 10 GbE but ~10× slower than NVLink. Layer-split via Exo is the only practical strategy; tensor parallelism over Thunderbolt is too latency-bound to be useful.

Why total VRAM is not the whole story

Distributed across multiple Macs over Thunderbolt. Each node runs Exo's layer shard — total 256 GB of memory but inter-node latency caps single-stream speed. Effective ~180 GB after cluster overhead.

See the multi-GPU guide for topology tradeoffs, and the RunLocalAI Will-It-Run Framework for the citable fit-tier method.

Topology

apple-cluster

Interconnect

thunderbolt~80 GB/s

Component count

4 units

Components

4×apple-m4-pro

Recommended runtime

exo

Also: mlx-lm

Recommended split strategy

pipeline-parallel

Also: layer-split

Setup difficulty

expert

~600W peak

Models that fit comfortably (24)

Effective VRAM utilization ≤ 85% at the smallest production quant. Comfortable headroom for KV cache.

GLM-5

Fits

200B·Q4_K_M → 140 GB·78% of effective VRAM·~60% speed penalty vs ideal

Kimi K1.5

Fits

200B·AWQ-INT4 → 140 GB·78% of effective VRAM·~60% speed penalty vs ideal

GLM-5 Pro

Fits

144B·AWQ-INT4 → 96 GB·53% of effective VRAM·~60% speed penalty vs ideal

Mixtral 8x22B Instruct

Fits

141B·Q4_K_M → 96 GB·53% of effective VRAM·~60% speed penalty vs ideal

WizardLM-2 8x22B

Fits

141B·Q4_K_M → 96 GB·53% of effective VRAM·~60% speed penalty vs ideal

DBRX Base

Fits

132B·Q4_K_M → 96 GB·53% of effective VRAM·~60% speed penalty vs ideal

DBRX Instruct

Fits

132B·AWQ-INT4 → 96 GB·53% of effective VRAM·~60% speed penalty vs ideal

Mistral Large 2 (123B)

Fits

123B·Q4_K_M → 88 GB·49% of effective VRAM·~60% speed penalty vs ideal

Nemotron 3 Super (120B-A12B)

Fits

120B·Q4_K_M → 84 GB·47% of effective VRAM·~60% speed penalty vs ideal

Llama 4 Scout

Fits

109B·Q4_K_M → 80 GB·44% of effective VRAM·~60% speed penalty vs ideal

Sarvam 105B

Fits

105B·Q4_K_M → 74 GB·41% of effective VRAM·~60% speed penalty vs ideal

Sarvam 105B FP8

Fits

105B·Q4_K_M → 74 GB·41% of effective VRAM·~60% speed penalty vs ideal

Command R+ (Aug 2024)

Fits

104B·AWQ-INT4 → 72 GB·40% of effective VRAM·~60% speed penalty vs ideal

Command R+ 104B

Fits

104B·Q4_K_M → 70 GB·39% of effective VRAM·~60% speed penalty vs ideal

Llama 3.2 90B Vision Instruct

Fits

90B·Q4_K_M → 60 GB·33% of effective VRAM·~60% speed penalty vs ideal

Llama 3.2 90B Vision

Fits

90B·AWQ-INT4 → 64 GB·36% of effective VRAM·~60% speed penalty vs ideal

InternVL 2.5 78B

Fits

78B·Q4_K_M → 52 GB·29% of effective VRAM·~60% speed penalty vs ideal

Molmo 72B

Fits

72B·Q4_K_M → 48 GB·27% of effective VRAM·~60% speed penalty vs ideal

Qwen 2.5 Math 72B

Fits

72B·Q4_K_M → 48 GB·27% of effective VRAM·~60% speed penalty vs ideal

Qwen 2.5 72B Instruct

Fits

72B·Q4_K_M → 48 GB·27% of effective VRAM·~60% speed penalty vs ideal

Qwen 2.5-VL 72B

Fits

72B·AWQ-INT4 → 48 GB·27% of effective VRAM·~60% speed penalty vs ideal

Llama 4 70B

Fits

70B·AWQ-INT4 → 48 GB·27% of effective VRAM·~60% speed penalty vs ideal

Tulu 3 70B

Fits

70B·Q4_K_M → 48 GB·27% of effective VRAM·~60% speed penalty vs ideal

Dolphin 3 Llama 3.3 70B

Fits

70B·AWQ-INT4 → 48 GB·27% of effective VRAM·~60% speed penalty vs ideal

Borderline (6)

Fits but with little headroom. KV cache for long context may not fit; verify before deployment.

DeepSeek V4 Flash (284B MoE)

Borderline

284B·Q4_K_M → 192 GB·107% of effective VRAM·~60% speed penalty vs ideal

Effective VRAM utilization >107% — KV cache for long context will not fit. Cap context at ~4-8K or move to a larger combo.

Llama 3.1 Nemotron Ultra 253B

Borderline

253B·Q4_K_M → 160 GB·89% of effective VRAM·~60% speed penalty vs ideal

Combination fits but with little headroom. Verify KV cache budget for your target context window before committing.

DeepSeek V2.5 236B

Borderline

236B·Q4_K_M → 160 GB·89% of effective VRAM·~60% speed penalty vs ideal

Combination fits but with little headroom. Verify KV cache budget for your target context window before committing.

DeepSeek Coder V2 236B

Borderline

236B·Q4_K_M → 160 GB·89% of effective VRAM·~60% speed penalty vs ideal

Combination fits but with little headroom. Verify KV cache budget for your target context window before committing.

K-EXAONE 236B A23B

Borderline

236B·Q4_K_M → 166 GB·92% of effective VRAM·~60% speed penalty vs ideal

Effective VRAM utilization >92% — KV cache for long context will not fit. Cap context at ~4-8K or move to a larger combo.

Qwen 3 235B-A22B

Borderline

235B·Q4_K_M → 160 GB·89% of effective VRAM·~60% speed penalty vs ideal

Combination fits but with little headroom. Verify KV cache budget for your target context window before committing.

Not practical (8)

Model weights exceed effective combo VRAM. Even with the recommended split strategy, this configuration won't run cleanly. Drop to a smaller quant or move to a larger combo.

DeepSeek V4 Pro (1.6T MoE)

Not practical

1600B·Q4_K_M → 1024 GB·569% of effective VRAM·~60% speed penalty vs ideal