14. Troubleshooting macOS AI
This chapter covers the most common failure modes in one place so you can diagnose fast.
Symptom: Model loads but runs at 1–3 tokens per second
Diagnosis: Check if Metal is active (Chapter 4). Check memory pressure: memory_pressure shows yellow or red. Check Activity Monitor GPU tab—likely under 5%. This combination means CPU fallback due to memory constraints or Metal not loaded.
Fix: Reduce context window size. Switch to a smaller quantization (Q5_K_M instead of Q8_0). Use a smaller model. Close other applications.
Symptom: "Metal not found" error in llama.cpp
Diagnosis: The binary was compiled without Metal support. This happens when you download a pre-built binary from GitHub releases that was not compiled on macOS with Metal flags.
Fix: Build llama.cpp from source with CMAKE_ARGS="-DGGML_METAL=ON" or use an Ollama model which bundles a Metal-compatible binary.
Symptom: Ollama runs in CLI but API returns connection refused
Diagnosis: Ollama CLI and Ollama server are separate processes. The CLI works without the server (it starts a subprocess). The API requires the server.
Fix: ollama serve in a dedicated terminal tab, then make API calls.
Symptom: Model file downloads successfully but fails to load
Diagnosis: Corrupted download, incomplete file, or wrong quantization for your hardware.
Fix:
# Verify file size
ls -lh ~/.ollama/models/blobs/*
# Remove and re-download
rm ~/.ollama/models/blobs/<problematic-hash>
ollama pull <model-name>
Symptom: MLX process killed with no error message
Diagnosis: OS killed the process due to out-of-memory. MLX does not produce a Python exception—it gets SIGKILL'd by the kernel.
Fix: Use a smaller model or reduce batch size. Check log show --predicate 'eventMessage contains "Killed"' --last 5m for kernel OOM logs.
Symptom: High CPU usage but model is not responding
Diagnosis: Model is in a swapping state—memory pressure forced it to disk.
Fix: Reduce model size or context. Check swapusage in Activity Monitor's Memory tab.
Introduce each failure mode intentionally: run a large model with a tiny context window (fails with memory), call the API before starting Ollama server (fails with connection refused), check GPU utilization for a model without Metal support. Knowing what failure looks like lets you diagnose it in seconds.