Metal Allocator: out of memory on Apple Silicon
Cause
Environment: Apple Silicon (M1/M2/M3/M4 family) running MLX or llama.cpp with Metal backend.
Severity: medium — recoverable with smaller model / context.
- macOS reserves a soft ceiling on GPU memory (~70-75% of physical RAM); Metal allocator OOMs even when
vm_statshows "free" pages because those are wired to other processes - Unified memory is shared with CPU + Neural Engine; Chrome/Xcode/Spotify can claim 6-10 GB unnoticed
- KV cache scales linearly with context — 32K context on a 32B model can double total footprint
- MLX's default memory limit is conservative; llama.cpp's Metal backend doesn't auto-back-off
Solution
1. Drop to a smaller quant (the biggest single win):
# 32B at Q8 ~ 35 GB. Q4_K_M ~ 19 GB. On a 32 GB Mac, Q4 fits, Q8 doesn't.
mlx_lm.convert --hf-path Qwen/Qwen2.5-32B-Instruct -q --q-bits 4
2. Reduce context — KV cache cost is linear in context length:
mlx_lm.generate --model qwen2.5-32b-mlx-4bit --max-tokens 512 \
--max-kv-size 8192
3. Raise MLX's GPU memory ceiling explicitly:
import mlx.core as mx
mx.metal.set_memory_limit(int(0.85 * 32 * 1024**3)) # 85% of 32 GB
mx.metal.set_cache_limit(0) # disable cache, free more for tensors
4. Close memory hogs. Activity Monitor → Memory → sort by Memory. Chrome (4-8 GB), Xcode indexing (2-4 GB), Slack (500 MB). Quit them.
5. Prefer llama.cpp Metal vs MLX if MLX OOMs on a workload llama.cpp tolerates — they have different allocator strategies:
./llama-cli -m model.gguf --n-gpu-layers 999 --ctx-size 8192
Alternative solutions
Caveat: unlike CUDA, Metal gives no clear OOM error during long-running generation — you may instead see "command buffer execution failed" mid-stream. Treat that as the same root cause and apply the same fixes.
Related errors
Did this fix it?
If your case was different, email Contact support with what you saw and we'll update the page. If it worked but took different commands on your platform, we want to know that too.