MLX vs Ollama vs llama.cpp — Local AI on macOS (Chapter 6)

These three runtimes serve different roles and understanding the trade-offs prevents wasted time. This is a comparison of architecture, not quality.

Runtime	Metal Support	Model Format	Install Complexity	Best For
Ollama	Yes (via llama.cpp)	GGUF	Low (single binary)	Quick start, API server, broad model support
llama.cpp	Yes (compile-time flag)	GGUF	Medium (build from source)	Maximum control, research, custom quantizations
MLX	Native	MLX	Medium (Python package)	Best Apple Silicon performance, fine-tuning

Ollama wraps llama.cpp and adds model management, an API server, and a CLI. It is the right choice when you want to run a model and move on. It handles downloads, versioning, and serving. The tradeoff: you have less visibility into what is happening at the compute level.

llama.cpp is the foundation. It is written in C/C++, compiles on everything, and supports the widest range of quantization formats. If you need to run a model with a custom quantization or experiment with GGUF variants, you are in llama.cpp territory. Compiling for Metal:

git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
mkdir build && cd build
cmake -DGGML_METAL=ON -DGGML_NATIVE=ON ..
make -j$(nproc)

MLX wins on raw throughput for Apple Silicon workloads. A 7B Q4_K_M model in MLX consistently produces 30–60 tokens/s on M3 Pro, where the same model in Ollama produces 15–30 tokens/s. The gap widens with larger models.

The practical choice: start with Ollama for any quick experiment. Graduate to MLX for any serious throughput requirement on Apple Silicon. Use llama.cpp when you need a quantization format or configuration that neither Ollama nor MLX supports.