llama.cpp Metal crash — when Apple Silicon inference fails
Most Metal crashes in llama.cpp on Apple Silicon trace to too-aggressive context size, an old GGUF format, or a model whose tensor shape Metal can't kernel. Diagnostic + fix order.
Diagnostic order — most likely first
Context window exceeds usable unified memory
Crash fires only at long contexts. M2/M3/M4 Macs share VRAM with system RAM. `top` shows memory pressure red right before crash.
Lower context (`-c 4096` instead of `-c 32768`). Apple Silicon's unified memory is shared with macOS; budget ~70% for model + KV cache, ~30% for system.
GGUF file is older format llama.cpp Metal backend rejects
`./llama-cli` errors before any inference: 'unknown GGUF version' or 'tensor not found in metal library'.
Re-download the GGUF from a recent source (HuggingFace user `bartowski` ships current builds). Or convert from safetensors with a current `convert-hf-to-gguf.py`.
Building llama.cpp without Metal flag
`./llama-cli` runs but is slow (CPU fallback). Output mentions 'CPU buffer' or 'no Metal device.'
Rebuild: `make clean && LLAMA_METAL=1 make`. For CMake: `-DGGML_METAL=ON`. Verify with `./llama-cli --help` listing Metal flags.
Tensor shape unsupported by Metal kernels (rare model)
Crash at first token. Model loads but inference fails on the first matmul. Often happens with new architecture variants.
Update llama.cpp to HEAD. If still failing, check the model's PR thread on the llama.cpp GitHub — Metal kernel coverage lags CUDA by 1-4 weeks for new architectures.
macOS GPU policy / external display routing eating memory
Close all GPU-heavy apps (Final Cut, Photoshop, Chrome with hardware accel). Memory pressure drops noticeably.
Run inference with apps closed. For headless servers, ensure no idle Metal-using process holds VRAM. `sudo memory_pressure -l warn` shows the policy.
Frequently asked questions
Should I use llama.cpp or MLX on Apple Silicon?
MLX is faster for Apple-native fine-tunes and matches llama.cpp on quantized inference. llama.cpp has wider model coverage (every GGUF on HuggingFace just works). For 2026, MLX for production fine-tuning, llama.cpp for general inference.
Why does my M4 Max with 64 GB run out of memory on a 70B Q4?
macOS reserves a chunk for system + UI (~8-12 GB on a desktop, more on laptops). 70B Q4 GGUF is ~40 GB on disk and needs ~45 GB at runtime including KV cache. On a 64 GB Mac that leaves ~7 GB headroom — usable for short context, fragile at long context.
Is Metal slower than CUDA at the same VRAM tier?
Roughly comparable for inference. M4 Max ~546 GB/s vs RTX 4090 ~1008 GB/s — bandwidth gap is real. For prompt processing M4 Max is 30-50% slower; for token generation closer to 70-90% of 4090.
Related troubleshooting
Mid-inference crashes (segfault, illegal memory access, kernel panic) usually mean VRAM ECC, thermal throttling, PSU instability, or a bad model file. Here's the diagnostic order.
When llama.cpp / Ollama outputs garbled text or repeats tokens infinitely, the tokenizer baked into the GGUF doesn't match the runtime's expectations. Here's how to confirm and fix.
When the fix is hardware
A surprising fraction of troubleshooting tickets resolve to: this card doesn't have enough VRAM for what you're asking it to do. If you're hitting OOM after every reasonable fix, or your GPU genuinely can't fit the model you need, it's upgrade time: