fatalEditorialReviewed May 2026

llama.cpp Metal crash — when Apple Silicon inference fails

Most Metal crashes in llama.cpp on Apple Silicon trace to too-aggressive context size, an old GGUF format, or a model whose tensor shape Metal can't kernel. Diagnostic + fix order.

llama.cppApple SiliconMetalOllama on MacLM Studio on Mac

By Fredoline Eruo · Last verified 2026-05-08

Diagnostic order — most likely first

Context window exceeds usable unified memory

Diagnose

Crash fires only at long contexts. M2/M3/M4 Macs share VRAM with system RAM. `top` shows memory pressure red right before crash.

Fix

Lower context (`-c 4096` instead of `-c 32768`). Apple Silicon's unified memory is shared with macOS; budget ~70% for model + KV cache, ~30% for system.

GGUF file is older format llama.cpp Metal backend rejects

Diagnose

`./llama-cli` errors before any inference: 'unknown GGUF version' or 'tensor not found in metal library'.

Fix

Re-download the GGUF from a recent source (HuggingFace user `bartowski` ships current builds). Or convert from safetensors with a current `convert-hf-to-gguf.py`.

Building llama.cpp without Metal flag

Diagnose

`./llama-cli` runs but is slow (CPU fallback). Output mentions 'CPU buffer' or 'no Metal device.'

Fix

Rebuild: `make clean && LLAMA_METAL=1 make`. For CMake: `-DGGML_METAL=ON`. Verify with `./llama-cli --help` listing Metal flags.

Tensor shape unsupported by Metal kernels (rare model)

Diagnose

Crash at first token. Model loads but inference fails on the first matmul. Often happens with new architecture variants.

Fix

Update llama.cpp to HEAD. If still failing, check the model's PR thread on the llama.cpp GitHub — Metal kernel coverage lags CUDA by 1-4 weeks for new architectures.

macOS GPU policy / external display routing eating memory

Diagnose

Close all GPU-heavy apps (Final Cut, Photoshop, Chrome with hardware accel). Memory pressure drops noticeably.

Fix

Run inference with apps closed. For headless servers, ensure no idle Metal-using process holds VRAM. `sudo memory_pressure -l warn` shows the policy.

Frequently asked questions

Should I use llama.cpp or MLX on Apple Silicon?

MLX is faster for Apple-native fine-tunes and matches llama.cpp on quantized inference. llama.cpp has wider model coverage (every GGUF on HuggingFace just works). For 2026, MLX for production fine-tuning, llama.cpp for general inference.

Why does my M4 Max with 64 GB run out of memory on a 70B Q4?

macOS reserves a chunk for system + UI (~8-12 GB on a desktop, more on laptops). 70B Q4 GGUF is ~40 GB on disk and needs ~45 GB at runtime including KV cache. On a 64 GB Mac that leaves ~7 GB headroom — usable for short context, fragile at long context.

Is Metal slower than CUDA at the same VRAM tier?

Roughly comparable for inference. M4 Max ~546 GB/s vs RTX 4090 ~1008 GB/s — bandwidth gap is real. For prompt processing M4 Max is 30-50% slower; for token generation closer to 70-90% of 4090.

Related troubleshooting

Model keeps crashing / segfault during inference

Mid-inference crashes (segfault, illegal memory access, kernel panic) usually mean VRAM ECC, thermal throttling, PSU instability, or a bad model file. Here's the diagnostic order.

GGUF tokenizer mismatch / 'tokenizer model not found'

When llama.cpp / Ollama outputs garbled text or repeats tokens infinitely, the tokenizer baked into the GGUF doesn't match the runtime's expectations. Here's how to confirm and fix.

When the fix is hardware

A surprising fraction of troubleshooting tickets resolve to: this card doesn't have enough VRAM for what you're asking it to do. If you're hitting OOM after every reasonable fix, or your GPU genuinely can't fit the model you need, it's upgrade time:

Where next?

All troubleshooting guides

OrBest GPU for local AI Will it run on my hardware?