Operator path
Operator-reviewed

Apple Silicon: M2 / M3 / M4 local AI

For: Mac owners on M2, M3, or M4 silicon with 16GB or more unified memory. By the end: A 14-32B-class model running on your Mac at usable speed, with the right runtime + quant for your chip and memory size.

By Fredoline Eruo7 milestonesLast reviewed 2026-05-07

Apple Silicon is genuinely good at local AI inference and genuinely different from running on NVIDIA. Unified memory means a 64GB M-series chip behaves more like a "64GB GPU" than a "16GB GPU + 48GB RAM" — but memory bandwidth, not capacity, becomes the constraint at large model sizes. This path moves you from a clean Ollama install to MLX-LM as the production runtime, with realistic expectations per chip tier.

Identify your chip and memory honestly

M2 / M3 / M4 base chips have 100GB/s memory bandwidth. Pro is 200-300GB/s. Max is 400-500GB/s. Ultra is 800GB/s+. Tokens per second on a memory-bound LLM scales roughly with bandwidth, so an M4 Max at 64GB can outpace an M2 at 96GB on the same model. Memory size sets your model ceiling; bandwidth sets your speed.

Heuristic: a chip with X GB of unified memory can comfortably load models up to about (X − 8)GB. The OS needs the rest.

When this is done you should have
Recorded: chip tier (M2 / M3 / M4 Pro / Max / Ultra), memory size, and memory bandwidth. Realistic model-size ceiling identified.

Install Ollama with the Metal backend

Ollama on macOS auto-uses Metal — no driver hunting, no CUDA install, no kernel pin. The starter installation is literally one binary. Pull a 7B Q4 model, run "ollama run", check that it generates faster than 15 tok/s. If it doesn't, you have a thermal or background-process problem, not a configuration one.

When this is done you should have
Ollama running on macOS with a 7B-class model, using the Metal GPU backend, generating tokens at expected speed for your chip.
Read nextOllamaSetup

Move up to a 14B-32B model

This is the milestone where Apple Silicon starts to look different from a 4090. A 32B model in Q4 is ~18GB; an M2/M3/M4 Pro with 24GB+ unified memory can comfortably run it. A 4090 cannot. This is the "unified memory makes big models accessible" advantage in practice.

Throughput will be lower than a 4090 on the same model. That's fine. The relevant question is whether the model fits and runs at usable speed for your tasks.

When this is done you should have
A 14B or 32B-class model running on your machine. Tokens per second written down. Activity Monitor showing both GPU and Memory usage.

Switch to MLX-LM for the production path

MLX is Apple's own ML framework. MLX-LM is the LLM- specific runtime built on top. Same model in MLX-4bit (or MLX-8bit) is typically 10-30% faster than the GGUF equivalent through Ollama, and uses Apple's compute framework rather than llama.cpp's. This is the upgrade from "easy mode" to "the best inference Apple Silicon can give you today."

Convert or download MLX-quantized weights. The model community on Hugging Face has MLX variants for most popular models; if not, MLX includes a one-line conversion tool from the original safetensors.

When this is done you should have
MLX-LM serving the same model with a measurable speedup over Ollama Metal on your chip. The OpenAI-compatible mlx-lm server running.

Pick the right MLX quant for your chip

MLX uses its own quant format (mlx-4bit, mlx-8bit). The heuristic is the same as GGUF: 4-bit is default, 8-bit for accuracy-critical workloads, FP16 for development (compare-vs-ground-truth) work. Don't run FP16 in production on an M-series; the bandwidth math doesn't favor it.

When this is done you should have
Working knowledge: 4-bit MLX is the default speed/quality sweet spot, 8-bit if accuracy matters and you have memory headroom.

Front the local server with a real client

LM Studio is a Mac-native app and the easiest path. Open WebUI runs in a Docker container and gives you the ChatGPT-style web UI. Either one points at the mlx-lm OpenAI-compatible server. Pick one.

When this is done you should have
LM Studio, Open WebUI, or your editor talking to your local mlx-lm endpoint. End-to-end local AI working from a UI.

Decide if you want to push to the chip's limit

If you have an M2/M3/M4 Ultra with 128GB+ unified memory, 70B-class models in MLX-4bit are practical. The Apple Silicon AI stack page documents the recipe end-to-end. For multi-Mac setups (rare, but possible), the multi- machine cluster stack walks layer-sharding via Exo over Thunderbolt.

For everyone else: stop at 32B-class. The Mac is a beautiful daily driver for that size; pushing past it costs more in waiting than it gives in capability.

When this is done you should have
A clear answer: stay at 32B-class for daily driver, or step up to 70B-class if you have an Ultra (M2/M3/M4 Ultra).

Next recommended step

The reference recipe: M-series + MLX-LM + Open WebUI + agent layer. Setup commands, expected outcome, and failure modes.