06. MLX vs Ollama vs llama.cpp
These three runtimes serve different roles and understanding the trade-offs prevents wasted time. This is a comparison of architecture, not quality.
| Runtime | Metal Support | Model Format | Install Complexity | Best For |
|---|---|---|---|---|
| Ollama | Yes (via llama.cpp) | GGUF | Low (single binary) | Quick start, API server, broad model support |
| llama.cpp | Yes (compile-time flag) | GGUF | Medium (build from source) | Maximum control, research, custom quantizations |
| MLX | Native | MLX | Medium (Python package) | Best Apple Silicon performance, fine-tuning |
Ollama wraps llama.cpp and adds model management, an API server, and a CLI. It is the right choice when you want to run a model and move on. It handles downloads, versioning, and serving. The tradeoff: you have less visibility into what is happening at the compute level.
llama.cpp is the foundation. It is written in C/C++, compiles on everything, and supports the widest range of quantization formats. If you need to run a model with a custom quantization or experiment with GGUF variants, you are in llama.cpp territory. Compiling for Metal:
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
mkdir build && cd build
cmake -DGGML_METAL=ON -DGGML_NATIVE=ON ..
make -j$(nproc)
MLX wins on raw throughput for Apple Silicon workloads. A 7B Q4_K_M model in MLX consistently produces 30–60 tokens/s on M3 Pro, where the same model in Ollama produces 15–30 tokens/s. The gap widens with larger models.
The practical choice: start with Ollama for any quick experiment. Graduate to MLX for any serious throughput requirement on Apple Silicon. Use llama.cpp when you need a quantization format or configuration that neither Ollama nor MLX supports.
Run llama3.2:3b in Ollama and time the output for a 200-token generation. If MLX is installed, run the equivalent and compare. The difference will be immediately measurable.