RUNLOCALAIv38
->Will it run?Best GPUCompareTroubleshootStartLearnPulseModelsHardwareToolsBench
Run check
RUNLOCALAI

Independently operated catalog for local-AI hardware and software. Hand-written verdicts. Source-cited claims. Reproducible commands when we have them.

OP·Fredoline Eruo
DIR
  • Models
  • Hardware
  • Tools
  • Benchmarks
TOOLS
  • Will it run?
  • Compare hardware
  • Cost vs cloud
  • Choose my GPU
  • Prompting kits
  • Quick answers
REF
  • All buyer guides
  • Learn local AI
  • Methodology
  • Glossary
  • Errors KB
  • Trust
EDITOR
  • About
  • Author
  • How we make money
  • Editorial policy
  • Contact
LEGAL
  • Privacy
  • Terms
  • Sitemap
MAIL · MONTHLY DIGEST
Get monthly local AI changes
Monthly recap. No spam.
DISCLOSURE

Some links on this site are affiliate links (Amazon Associates and other first-class retailers). When you buy through them, we earn a small commission at no extra cost to you. Affiliate links do not influence our verdicts — there are cards we rate highly that we don't have affiliate relationships with, and cards that sell well that we refuse to recommend. Read more →

© 2026 runlocalai.coIndependently operated
RUNLOCALAI · v38
  1. >
  2. Home
  3. /Will it run?
  4. /4× Mac Mini M4 Pro Exo cluster (256 GB total)
Apple clusterThunderboltexpert

What runs on 4× Mac Mini M4 Pro Exo cluster (256 GB total)?

Four Mac Mini M4 Pro nodes with 64 GB unified memory each, connected via Thunderbolt 5. Exo distributes layers across machines. 256 GB total / ~180 GB effective for inference.

At a glance
Effective VRAM
180 / 256 GB
Not pooled
Speed penalty
~60%
vs ideal single-card
Recommended runtime
exo
pipeline parallel
Setup difficulty
expert
~600W peak
24
Models fit
6
Borderline
8
Not practical
Deployment recipe
Multi-machine Apple cluster →

Exo-based multi-Mac sharding over Thunderbolt 5 — the cluster recipe for >192GB unified memory targets.

Memory budget
Total VRAM
256 GB
Effective for inference
180 GB
70% of total
Not pooled

Exo clusters 4 Macs into a single inference target by sharding model layers across machines. Total memory is 4× 64 = 256 GB, but each node reserves OS overhead and KV-cache buffers, and inter-node communication costs ~10-15% effective capacity. Concretely: a 200B-class model at Q4 (~110 GB) distributes across 4 nodes with ~25-30 GB per node, leaving comfortable headroom on each. Thunderbolt 5 (80 Gbps bidirectional, 120 Gbps in display mode) is the communication path — meaningfully faster than 10 GbE but ~10× slower than NVLink. Layer-split via Exo is the only practical strategy; tensor parallelism over Thunderbolt is too latency-bound to be useful.

Why total VRAM is not the whole story

Distributed across multiple Macs over Thunderbolt. Each node runs Exo's layer shard — total 256 GB of memory but inter-node latency caps single-stream speed. Effective ~180 GB after cluster overhead.

See the multi-GPU guide for topology tradeoffs, and the RunLocalAI Will-It-Run Framework for the citable fit-tier method.

Topology

Topology
apple-cluster
Interconnect
thunderbolt~80 GB/s
Component count
4 units
Components
  • 4×apple-m4-pro
Recommended runtime
exo
Also: mlx-lm
Recommended split strategy
pipeline-parallel
Also: layer-split
Setup difficulty
expert
~600W peak

Models that fit comfortably (24)

Effective VRAM utilization ≤ 85% at the smallest production quant. Comfortable headroom for KV cache.

GLM-5
Fits
200B·Q4_K_M → 140 GB·78% of effective VRAM·~60% speed penalty vs ideal
Kimi K1.5
Fits
200B·AWQ-INT4 → 140 GB·78% of effective VRAM·~60% speed penalty vs ideal
GLM-5 Pro
Fits
144B·AWQ-INT4 → 96 GB·53% of effective VRAM·~60% speed penalty vs ideal
Mixtral 8x22B Instruct
Fits
141B·Q4_K_M → 96 GB·53% of effective VRAM·~60% speed penalty vs ideal
WizardLM-2 8x22B
Fits
141B·Q4_K_M → 96 GB·53% of effective VRAM·~60% speed penalty vs ideal
DBRX Base
Fits
132B·Q4_K_M → 96 GB·53% of effective VRAM·~60% speed penalty vs ideal
DBRX Instruct
Fits
132B·AWQ-INT4 → 96 GB·53% of effective VRAM·~60% speed penalty vs ideal
Mistral Large 2 (123B)
Fits
123B·Q4_K_M → 88 GB·49% of effective VRAM·~60% speed penalty vs ideal
Nemotron 3 Super (120B-A12B)
Fits
120B·Q4_K_M → 84 GB·47% of effective VRAM·~60% speed penalty vs ideal
Llama 4 Scout
Fits
109B·Q4_K_M → 80 GB·44% of effective VRAM·~60% speed penalty vs ideal
Sarvam 105B
Fits
105B·Q4_K_M → 74 GB·41% of effective VRAM·~60% speed penalty vs ideal
Sarvam 105B FP8
Fits
105B·Q4_K_M → 74 GB·41% of effective VRAM·~60% speed penalty vs ideal
Command R+ (Aug 2024)
Fits
104B·AWQ-INT4 → 72 GB·40% of effective VRAM·~60% speed penalty vs ideal
Command R+ 104B
Fits
104B·Q4_K_M → 70 GB·39% of effective VRAM·~60% speed penalty vs ideal
Llama 3.2 90B Vision Instruct
Fits
90B·Q4_K_M → 60 GB·33% of effective VRAM·~60% speed penalty vs ideal
Llama 3.2 90B Vision
Fits
90B·AWQ-INT4 → 64 GB·36% of effective VRAM·~60% speed penalty vs ideal
InternVL 2.5 78B
Fits
78B·Q4_K_M → 52 GB·29% of effective VRAM·~60% speed penalty vs ideal
Molmo 72B
Fits
72B·Q4_K_M → 48 GB·27% of effective VRAM·~60% speed penalty vs ideal
Qwen 2.5 Math 72B
Fits
72B·Q4_K_M → 48 GB·27% of effective VRAM·~60% speed penalty vs ideal
Qwen 2.5 72B Instruct
Fits
72B·Q4_K_M → 48 GB·27% of effective VRAM·~60% speed penalty vs ideal
Qwen 2.5-VL 72B
Fits
72B·AWQ-INT4 → 48 GB·27% of effective VRAM·~60% speed penalty vs ideal
Llama 4 70B
Fits
70B·AWQ-INT4 → 48 GB·27% of effective VRAM·~60% speed penalty vs ideal
Tulu 3 70B
Fits
70B·Q4_K_M → 48 GB·27% of effective VRAM·~60% speed penalty vs ideal
Dolphin 3 Llama 3.3 70B
Fits
70B·AWQ-INT4 → 48 GB·27% of effective VRAM·~60% speed penalty vs ideal

Borderline (6)

Fits but with little headroom. KV cache for long context may not fit; verify before deployment.

DeepSeek V4 Flash (284B MoE)
Borderline
284B·Q4_K_M → 192 GB·107% of effective VRAM·~60% speed penalty vs ideal

Effective VRAM utilization >107% — KV cache for long context will not fit. Cap context at ~4-8K or move to a larger combo.

Llama 3.1 Nemotron Ultra 253B
Borderline
253B·Q4_K_M → 160 GB·89% of effective VRAM·~60% speed penalty vs ideal

Combination fits but with little headroom. Verify KV cache budget for your target context window before committing.

DeepSeek V2.5 236B
Borderline
236B·Q4_K_M → 160 GB·89% of effective VRAM·~60% speed penalty vs ideal

Combination fits but with little headroom. Verify KV cache budget for your target context window before committing.

DeepSeek Coder V2 236B
Borderline
236B·Q4_K_M → 160 GB·89% of effective VRAM·~60% speed penalty vs ideal

Combination fits but with little headroom. Verify KV cache budget for your target context window before committing.

K-EXAONE 236B A23B
Borderline
236B·Q4_K_M → 166 GB·92% of effective VRAM·~60% speed penalty vs ideal

Effective VRAM utilization >92% — KV cache for long context will not fit. Cap context at ~4-8K or move to a larger combo.

Qwen 3 235B-A22B
Borderline
235B·Q4_K_M → 160 GB·89% of effective VRAM·~60% speed penalty vs ideal

Combination fits but with little headroom. Verify KV cache budget for your target context window before committing.

Not practical (8)

Model weights exceed effective combo VRAM. Even with the recommended split strategy, this configuration won't run cleanly. Drop to a smaller quant or move to a larger combo.

DeepSeek V4 Pro (1.6T MoE)
Not practical
1600B·Q4_K_M → 1024 GB·569% of effective VRAM·~60% speed penalty vs ideal

Model weights exceed effective combo VRAM. Even with the recommended split strategy, this configuration won't run cleanly.

Step-3
Not practical
1000B·AWQ-INT4 → 640 GB·356% of effective VRAM·~60% speed penalty vs ideal

Model weights exceed effective combo VRAM. Even with the recommended split strategy, this configuration won't run cleanly.

Kimi K2.6
Not practical
1000B·Q4_K_M → 700 GB·389% of effective VRAM·~60% speed penalty vs ideal

Model weights exceed effective combo VRAM. Even with the recommended split strategy, this configuration won't run cleanly.

DeepSeek V4
Not practical
745B·AWQ-INT4 → 480 GB·267% of effective VRAM·~60% speed penalty vs ideal

Model weights exceed effective combo VRAM. Even with the recommended split strategy, this configuration won't run cleanly.

Mistral Medium 3.5 (675B MoE)
Not practical
675B·Q4_K_M → 448 GB·249% of effective VRAM·~60% speed penalty vs ideal

Model weights exceed effective combo VRAM. Even with the recommended split strategy, this configuration won't run cleanly.

DeepSeek R1 (671B reasoning)
Not practical
671B·Q4_K_M → 420 GB·233% of effective VRAM·~60% speed penalty vs ideal

Model weights exceed effective combo VRAM. Even with the recommended split strategy, this configuration won't run cleanly.

DeepSeek V3 (671B MoE)
Not practical
671B·Q4_K_M → 420 GB·233% of effective VRAM·~60% speed penalty vs ideal

Model weights exceed effective combo VRAM. Even with the recommended split strategy, this configuration won't run cleanly.

Llama 4 405B
Not practical
405B·AWQ-INT4 → 280 GB·156% of effective VRAM·~60% speed penalty vs ideal

Model weights exceed effective combo VRAM. Even with the recommended split strategy, this configuration won't run cleanly.

Benchmark opportunities

estimates, not measurements

Pending benchmark targets for this combo. Once measured, results land in the catalog as benchmarks.

4× Mac Mini M4 Pro Exo cluster + Llama 3.1 70B (MLX-4bit)
pending
Estimate: 4-9 tok/s decode (Thunderbolt 5 inter-node)

Multi-Mac Exo cluster. 70B at MLX-4bit (~40GB) shards across 4 nodes; Thunderbolt 5 latency dominates. Compare against single Mac Studio M3 Ultra to quantify cluster overhead.

Going deeper

  • Full combo detail page — operational review with failure modes and runtime matrix.
  • Multi-GPU buying guide — when multi-GPU is worth it and when it isn't.
  • RunLocalAI Will-It-Run Framework — citable effective-VRAM, working-set, fit-tier, and evidence-tier method.
  • Will-it-run home — single-card check + custom builds.