What causes "Metal Allocator: out of memory on Apple Silicon"?

**Environment:** Apple Silicon (M1/M2/M3/M4 family) running [MLX](/tools/mlx-lm) or [llama.cpp](/tools/llama-cpp) with Metal backend. **Severity: medium** — recoverable with smaller model / context. - macOS reserves a soft ceiling on GPU memory (~70-75% of physical RAM); Metal allocator OOMs even when `vm_stat` shows "free" pages because those are wired to other processes - Unified memory is shared with CPU + Neural Engine; Chrome/Xcode/Spotify can claim 6-10 GB unnoticed - KV cache scales linearly with context — 32K context on a 32B model can double total footprint - MLX's default memory limit is conservative; llama.cpp's Metal backend doesn't auto-back-off

Metal / Apple Silicon

Verified by owner

Metal Allocator: out of memory on Apple Silicon

Q: How do you fix "Metal Allocator: out of memory on Apple Silicon"?

**1. Drop to a smaller quant** (the biggest single win): ```bash # 32B at Q8 ~ 35 GB. Q4_K_M ~ 19 GB. On a 32 GB Mac, Q4 fits, Q8 doesn't. mlx_lm.convert --hf-path Qwen/Qwen2.5-32B-Instruct -q --q-bits 4 ``` **2. Reduce context** — KV cache cost is linear in context length: ```bash mlx_lm.generate --model qwen2.5-32b-mlx-4bit --max-tokens 512 \ --max-kv-size 8192 ``` **3. Raise MLX's GPU memory ceiling explicitly:** ```python import mlx.core as mx mx.metal.set_memory_limit(int(0.85 * 32 * 1024**3)) # 85% of 32 GB mx.metal.set_cache_limit(0) # disable cache, free more for tensors ``` **4. Close memory hogs.** `Activity Monitor → Memory → sort by Memory`. Chrome (4-8 GB), Xcode indexing (2-4 GB), Slack (500 MB). Quit them. **5. Prefer llama.cpp Metal vs MLX** if MLX OOMs on a workload llama.cpp tolerates — they have different allocator strategies: ```bash ./llama-cli -m model.gguf --n-gpu-layers 999 --ctx-size 8192 ```

[METAL] Metal Allocator: out of memory (Allocation size X exceeds available)

By Fredoline Eruo · Last verified Jun 12, 2026

Cause

Environment: Apple Silicon (M1/M2/M3/M4 family) running MLX or llama.cpp with Metal backend.

Severity: medium — recoverable with smaller model / context.

macOS reserves a soft ceiling on GPU memory (~70-75% of physical RAM); Metal allocator OOMs even when vm_stat shows "free" pages because those are wired to other processes
Unified memory is shared with CPU + Neural Engine; Chrome/Xcode/Spotify can claim 6-10 GB unnoticed
KV cache scales linearly with context — 32K context on a 32B model can double total footprint
MLX's default memory limit is conservative; llama.cpp's Metal backend doesn't auto-back-off

Solution

1. Drop to a smaller quant (the biggest single win):

# 32B at Q8 ~ 35 GB. Q4_K_M ~ 19 GB. On a 32 GB Mac, Q4 fits, Q8 doesn't.
mlx_lm.convert --hf-path Qwen/Qwen2.5-32B-Instruct -q --q-bits 4

2. Reduce context — KV cache cost is linear in context length:

mlx_lm.generate --model qwen2.5-32b-mlx-4bit --max-tokens 512 \
  --max-kv-size 8192

3. Raise MLX's GPU memory ceiling explicitly:

import mlx.core as mx
mx.metal.set_memory_limit(int(0.85 * 32 * 1024**3))  # 85% of 32 GB
mx.metal.set_cache_limit(0)  # disable cache, free more for tensors

4. Close memory hogs. Activity Monitor → Memory → sort by Memory. Chrome (4-8 GB), Xcode indexing (2-4 GB), Slack (500 MB). Quit them.

5. Prefer llama.cpp Metal vs MLX if MLX OOMs on a workload llama.cpp tolerates — they have different allocator strategies:

./llama-cli -m model.gguf --n-gpu-layers 999 --ctx-size 8192

Alternative solutions

Caveat: unlike CUDA, Metal gives no clear OOM error during long-running generation — you may instead see "command buffer execution failed" mid-stream. Treat that as the same root cause and apply the same fixes.

Related errors

Did this fix it?

If your case was different, email Contact support with what you saw and we'll update the page. If it worked but took different commands on your platform, we want to know that too.