RUNLOCALAIv38
->Will it run?Best GPUCompareTroubleshootStartLearnPulseModelsHardwareToolsBench
Run check
RUNLOCALAI

Independently operated catalog for local-AI hardware and software. Hand-written verdicts. Source-cited claims. Reproducible commands when we have them.

OP·Fredoline Eruo
DIR
  • Models
  • Hardware
  • Tools
  • Benchmarks
TOOLS
  • Will it run?
  • Compare hardware
  • Cost vs cloud
  • Choose my GPU
  • Prompting kits
  • Quick answers
REF
  • All buyer guides
  • Learn local AI
  • Methodology
  • Glossary
  • Errors KB
  • Trust
EDITOR
  • About
  • Author
  • How we make money
  • Editorial policy
  • Contact
LEGAL
  • Privacy
  • Terms
  • Sitemap
MAIL · MONTHLY DIGEST
Get monthly local AI changes
Monthly recap. No spam.
DISCLOSURE

Some links on this site are affiliate links (Amazon Associates and other first-class retailers). When you buy through them, we earn a small commission at no extra cost to you. Affiliate links do not influence our verdicts — there are cards we rate highly that we don't have affiliate relationships with, and cards that sell well that we refuse to recommend. Read more →

© 2026 runlocalai.coIndependently operated
RUNLOCALAI · v38
Errors / Metal / Apple Silicon / Metal Allocator: out of memory on Apple Silicon
Metal / Apple Silicon
Verified by owner

Metal Allocator: out of memory on Apple Silicon

[METAL] Metal Allocator: out of memory (Allocation size X exceeds available)
By Fredoline Eruo · Last verified Jun 12, 2026

Cause

Environment: Apple Silicon (M1/M2/M3/M4 family) running MLX or llama.cpp with Metal backend.

Severity: medium — recoverable with smaller model / context.

  • macOS reserves a soft ceiling on GPU memory (~70-75% of physical RAM); Metal allocator OOMs even when vm_stat shows "free" pages because those are wired to other processes
  • Unified memory is shared with CPU + Neural Engine; Chrome/Xcode/Spotify can claim 6-10 GB unnoticed
  • KV cache scales linearly with context — 32K context on a 32B model can double total footprint
  • MLX's default memory limit is conservative; llama.cpp's Metal backend doesn't auto-back-off

Solution

1. Drop to a smaller quant (the biggest single win):

# 32B at Q8 ~ 35 GB. Q4_K_M ~ 19 GB. On a 32 GB Mac, Q4 fits, Q8 doesn't.
mlx_lm.convert --hf-path Qwen/Qwen2.5-32B-Instruct -q --q-bits 4

2. Reduce context — KV cache cost is linear in context length:

mlx_lm.generate --model qwen2.5-32b-mlx-4bit --max-tokens 512 \
  --max-kv-size 8192

3. Raise MLX's GPU memory ceiling explicitly:

import mlx.core as mx
mx.metal.set_memory_limit(int(0.85 * 32 * 1024**3))  # 85% of 32 GB
mx.metal.set_cache_limit(0)  # disable cache, free more for tensors

4. Close memory hogs. Activity Monitor → Memory → sort by Memory. Chrome (4-8 GB), Xcode indexing (2-4 GB), Slack (500 MB). Quit them.

5. Prefer llama.cpp Metal vs MLX if MLX OOMs on a workload llama.cpp tolerates — they have different allocator strategies:

./llama-cli -m model.gguf --n-gpu-layers 999 --ctx-size 8192

Alternative solutions

Caveat: unlike CUDA, Metal gives no clear OOM error during long-running generation — you may instead see "command buffer execution failed" mid-stream. Treat that as the same root cause and apply the same fixes.

Related errors

  • MLX / Metal: command buffer execution failed
  • Apple Silicon: RuntimeError: MPS backend out of memory
  • Metal allocation failed — Apple Silicon OOM under unified memory pressure
  • MLX: Memory pressure detected — consider reducing batch size

Did this fix it?

If your case was different, email Contact support with what you saw and we'll update the page. If it worked but took different commands on your platform, we want to know that too.