Stack · L3 execution·Workstation tier

Build a Mac-native AI stack (May 2026)

A Mac-native local AI stack that takes full advantage of unified memory and (optionally) scales across multiple Macs via Thunderbolt 5 — runs 32B-class models comfortably on a single Mac, frontier-class models across a cluster.

By Fredoline Eruo·Last reviewed 2026-05-06·~12 min read
The stack
  1. 01
    HardwareCompute (Apple Silicon GPU + unified memory)
    apple-m3-max

    M3 Max 64GB is the single-Mac sweet spot in May 2026 — 36 GPU cores, 400 GB/s memory bandwidth, all 64GB addressable as VRAM. M4 Pro / M4 Max win on Thunderbolt 5 RDMA for clustering; for single-Mac use, M3 Max delivers 90% of M4 Max throughput at lower price.

  2. 02
    ToolInference engine (Apple-native)
    mlx-lm

    MLX-LM over llama.cpp on M-series silicon: matched throughput on short context, ~15-25% faster on long context (32K+), and the path that pairs with Exo for cluster scaling. Use llama.cpp when you need GGUF quants MLX hasn't picked up yet.

  3. 03
    ToolModel-swap layer (ad-hoc experimentation)
    ollama

    Ollama on Mac uses llama.cpp under the hood — runs alongside MLX-LM for the 'pull a new model right now' workflow. Different role than MLX-LM (Ollama wraps llama.cpp; MLX-LM is the Apple-native path). Both alive on different ports.

  4. 04
    ToolDistributed serving (multi-Mac cluster)
    exo

    Exo is what makes multi-Mac credible in 2026: auto-discovers nearby Apple Silicon devices on the LAN, shards models across them via pipeline parallel on top of MLX. Thunderbolt 5 + macOS 26.2 RDMA cuts inter-device latency by ~99%, turning consumer-Mac clusters into a real serving option.

  5. 05
    ModelCoding model (single-Mac primary)
    qwen-2.5-coder-32b-instruct

    Qwen 2.5 Coder 32B in MLX-4bit quant runs comfortably on a 64GB M3 Max with room for 32K context. Beats DeepSeek Coder V2 Lite on coding benchmarks at the same memory footprint.

  6. 06
    ToolChat frontend
    openwebui

    Open WebUI runs in Docker Desktop or directly via npm; talks to MLX-LM's OpenAI-compatible bridge. Same multi-user ergonomics as on Linux/Windows; native Apple Silicon container performance is now within 5% of bare metal.

  7. 07
    ToolMCP host (agent workflows)
    claude-desktop

    Claude Desktop is the native macOS MCP host with the strictest spec implementation. Pairs with MCP servers (filesystem, git, search) to give agentic workflows a polished native UI. Use Claude Desktop alongside Open WebUI — different roles.

Why Apple Silicon is no longer second-class

Three things changed through 2025-2026 that make this stack a serious option for the first time:

MLX-LM caught up. Through 2024 the consensus was “llama.cpp Metal beats MLX on Apple Silicon for everything except long-context.” Through 2025 MLX closed the throughput gap on short context and extended its long-context lead. As of May 2026, MLX-LM matches or exceeds llama.cpp Metal across the workloads most users care about.

Thunderbolt 5 + macOS 26.2 RDMA shipped. On M4 Pro+ hardware running macOS 26.2, Thunderbolt 5 cables carry RDMA — Remote Direct Memory Access — between Macs at near-PCIe speeds. Inter-device latency for tensor parallel dropped by ~99% compared to the pre-RDMA path. That single change made consumer-Mac clusters credible for serving frontier-class models.

Exo matured. The auto-discovery LAN cluster tool that sits on top of MLX. As of the May 2026 release, DeepSeek V3 671B runs at 5.37 tok/s on 8x M4 Pro Mac Minis — slower than a datacenter cluster, but on hardware most serious developers can actually buy.

Step-by-step setup (single Mac)

1. Install MLX-LM as the inference engine

# Install MLX-LM via pip — UV is fastest if you have it
uv tool install mlx-lm
# or: pip install mlx-lm

# Pull and serve a coding model in MLX 4-bit quant
mlx_lm.server \
  --model mlx-community/Qwen2.5-Coder-32B-Instruct-4bit \
  --port 8000 \
  --host 127.0.0.1

The MLX server exposes an OpenAI-compatible /v1 endpoint on the port you pick. First load downloads the model (~18GB for the 4-bit quant) and warms the Metal kernels — expect 30-60 seconds before first token on cold start.

2. Add Ollama for ad-hoc model swaps

# Install Ollama natively (uses llama.cpp under the hood)
brew install ollama

# Pull a smaller chat model that complements the MLX coding model
ollama serve &
ollama pull qwen3:14b

# Verify both runtimes alive on different ports
curl http://localhost:8000/v1/models   # MLX-LM (Qwen Coder 32B)
curl http://localhost:11434/api/tags   # Ollama (Qwen 3 14B)

Ollama and MLX-LM coexist on the same Mac because they hold different model weights and use the same Metal device driver. Total memory under load: ~26GB unified; idle Ollama drops back to ~14GB. The 64GB M3 Max has comfortable headroom for both plus your normal workflow.

3. Wire Open WebUI as the chat frontend

# Run Open WebUI in Docker Desktop on Apple Silicon (native ARM)
docker run -d --name open-webui \
  -p 3000:8080 \
  --restart unless-stopped \
  -v open-webui:/app/backend/data \
  -e OPENAI_API_BASE_URLS="http://host.docker.internal:8000/v1" \
  -e OPENAI_API_KEYS="any-string" \
  -e ENABLE_OLLAMA_API=true \
  -e OLLAMA_BASE_URLS="http://host.docker.internal:11434" \
  ghcr.io/open-webui/open-webui:latest

Native ARM containers on Apple Silicon Docker Desktop run within ~5% of bare metal performance now. Open WebUI sees both backends and the model switcher works the same as on Linux/Windows.

4. Add Claude Desktop with MCP for agentic workflows

# Install Claude Desktop (via Mac App Store or direct download)
# Then edit ~/Library/Application Support/Claude/claude_desktop_config.json:
{
  "mcpServers": {
    "filesystem": {
      "command": "npx",
      "args": ["-y", "@modelcontextprotocol/server-filesystem", "/Users/you/projects"]
    },
    "git": {
      "command": "npx",
      "args": ["-y", "@modelcontextprotocol/server-git", "--repository", "/Users/you/projects/main-repo"]
    }
  }
}

Claude Desktop is the native macOS MCP host with the strictest spec implementation — different role from Open WebUI, used for agentic workflows that need filesystem and git access. Restart Claude Desktop after editing config; the MCP servers launch on app startup.

Multi-Mac clustering with Exo

The single-Mac stack runs comfortably to ~70B models in MLX-4bit. To go larger — DeepSeek V3, Llama 3.1 405B class — you need a cluster. Exo turns 2-8 Apple Silicon devices on the same LAN into one logical inference target.

# Install on every Mac in the cluster
brew install exo

# On the Mac you'll use as the entry point, just run:
exo

# Exo auto-discovers other Macs on the LAN running exo and
# shards model layers via pipeline parallel. With Thunderbolt 5
# RDMA enabled (macOS 26.2 + M4 Pro+), the inter-device latency
# is near-PCIe — 8x M4 Pro Mac Minis run DeepSeek V3 671B at
# 5.37 tok/s, which is genuinely usable.

See /systems/distributed-inference for the architectural depth on what's actually happening when Exo shards a model across machines, and the conditions under which Thunderbolt 5 RDMA pays for the extra hardware (it usually does for 70B+ models that don't fit a single Mac; it usually doesn't for smaller models that already fit).

Failure modes you'll hit

  1. Metal kernel cold start. First inference after a fresh model load takes 30-60 seconds longer than expected. The Metal compiler is JIT-compiling kernels on first dispatch. Subsequent calls are fast. Pre-warm by sending a 10-token prompt at server startup.
  2. Activity Monitor shows GPU at 100% but tok/s is low. Almost always thermal throttling on a chassis without active cooling (MacBook Pro sustained workload). Plug into power, lift the laptop off the desk for airflow, or move to a Mac Studio / Mac Mini for sustained inference.
  3. Exo doesn't auto-discover other Macs. Multicast DNS is blocked by some routers. Fix: point Exo at peer IPs explicitly with exo --discovery=manual --peers=192.168.1.5,192.168.1.6.
  4. Thunderbolt 5 RDMA falls back to non-RDMA. One node on macOS 26.1 silently downgrades the cluster. Verify all nodes report system_profiler SPThunderboltDataType | grep RDMA shows enabled.
  5. MLX-LM doesn't support a quant format you need. MLX has its own quant format (mlx-community/*-4bit); GGUF support is limited to specific architectures. If a model is only available as GGUF, run it via Ollama instead.
  6. Open WebUI Docker Desktop high CPU on idle. Apple Silicon Docker Desktop can pin a CPU core at 10-20% even with no containers running. Set Docker Desktop to Preferences → Resources → Advanced and limit CPU to 4 cores; the savings on battery are noticeable.

Variations and alternatives

llama.cpp instead of MLX-LM. If you need GGUF compatibility (sharing models with Linux/Windows users) or a model MLX hasn't picked up, swap the inference engine. The rest of the stack (Ollama, Open WebUI, Claude Desktop, Exo) stays the same — Exo can drive llama.cpp via its OpenAI-compatible bridge.

M4 Pro / M4 Max instead of M3 Max. Pick M4 Pro / M4 Max if you plan to cluster — Thunderbolt 5 RDMA only works on those generations. For single-Mac use, M3 Max delivers ~90% of the throughput at lower cost.

Cross-platform homelab variation. If your stack mixes Apple Silicon and a Linux GPU box, see the RTX 4090 workstation stack for the GPU side. Both stacks expose OpenAI-compatible endpoints; a single Open WebUI instance can show models from both as siblings.

Coding-agent specialisation. If your workload is mostly autonomous coding, see the dedicated local coding-agent stack — same MLX-LM engine but specialised around OpenHands + Mem0 + git-MCP rather than a generalist Open WebUI surface.

Going deeper

  • Apple M3 Max catalog entry — unified-memory characteristics, GPU core scaling, thermal envelope under sustained load.
  • MLX-LM catalog entry — the Apple-native inference path with quant format details and architecture coverage.
  • Exo catalog entry — the multi-Mac clustering layer, including the Thunderbolt 5 RDMA prerequisite and how to verify it's active.
  • /systems/distributed-inference — protocol-engineering depth on what happens when Exo shards a model across machines, and the latency math that determines whether the cluster pays for itself.
  • Inference runtime ecosystem map — where MLX-LM and Ollama sit relative to vLLM / SGLang / llama.cpp and the broader landscape.