06. MLX vs Ollama vs llama.cpp

Chapter 6 of 15 · 15 min

These three runtimes serve different roles and understanding the trade-offs prevents wasted time. This is a comparison of architecture, not quality.

Runtime Metal Support Model Format Install Complexity Best For
Ollama Yes (via llama.cpp) GGUF Low (single binary) Quick start, API server, broad model support
llama.cpp Yes (compile-time flag) GGUF Medium (build from source) Maximum control, research, custom quantizations
MLX Native MLX Medium (Python package) Best Apple Silicon performance, fine-tuning

Ollama wraps llama.cpp and adds model management, an API server, and a CLI. It is the right choice when you want to run a model and move on. It handles downloads, versioning, and serving. The tradeoff: you have less visibility into what is happening at the compute level.

llama.cpp is the foundation. It is written in C/C++, compiles on everything, and supports the widest range of quantization formats. If you need to run a model with a custom quantization or experiment with GGUF variants, you are in llama.cpp territory. Compiling for Metal:

git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
mkdir build && cd build
cmake -DGGML_METAL=ON -DGGML_NATIVE=ON ..
make -j$(nproc)

MLX wins on raw throughput for Apple Silicon workloads. A 7B Q4_K_M model in MLX consistently produces 30–60 tokens/s on M3 Pro, where the same model in Ollama produces 15–30 tokens/s. The gap widens with larger models.

The practical choice: start with Ollama for any quick experiment. Graduate to MLX for any serious throughput requirement on Apple Silicon. Use llama.cpp when you need a quantization format or configuration that neither Ollama nor MLX supports.

EXERCISE

Run llama3.2:3b in Ollama and time the output for a 200-token generation. If MLX is installed, run the equivalent and compare. The difference will be immediately measurable.