Ollama vs llama.cpp — wrapper vs raw runtime
Ollama wraps llama.cpp. Underneath, the inference engine is the same — the throughput gap is small. The decision is about the layer above: do you want Ollama's ergonomics, or do you want llama.cpp's control?
Ollama wins on developer experience by a mile. `ollama pull llama3` and you're running. Model management, OpenAI-compatible API, auto-update — all handled. llama.cpp gives you full control over build flags, kernel selection, server config — at the cost of writing more shell.
Most operators start with Ollama. Some grow out of it as their needs get specific (custom kernel flags, manual KV cache sizing, multi-GPU layer splits).
Quick decision rules
Operational matrix
| Dimension | Ollama Local-first wrapper over llama.cpp with ergonomic model management. | llama.cpp Cross-platform CPU+GPU inference; the reference portable runtime. |
|---|---|---|
Setup time First-success latency for a new user. | Excellent Single installer; first model running in under 5 min. | Acceptable Compile + flag selection + GGUF download. ~30 min first time. |
Model management Pulling, caching, updating models. | Excellent `ollama pull` + manifest is the design point. | Limited Manual: download GGUF, organize files yourself. |
OpenAI-compatible API Drop-in for existing tools. | Excellent Built-in `/v1/chat/completions`. | Strong `llama-server` provides OpenAI-compatible mode. |
Build / kernel flexibility Custom compile flags, kernel selection. | Limited Hidden behind environment variables; some flags missing. | Excellent Full Make/CMake control; the design point. |
Multi-GPU split Layer split across cards. | Acceptable OLLAMA_NUM_PARALLEL + auto-split; less precise control. | Strong Manual `--n-gpu-layers` + `--tensor-split` for fine control. |
Reproducibility Same setup six months later. | Strong Manifest + model digest pin; auto-update can drift if you don't pin. | Excellent Pin commit hash + GGUF. Most reproducible runtime. |
Maintenance burden Operator hours per month. | Excellent Effectively zero. Auto-update + restart on schedule. | Strong <1 h/mo if you pin; you choose when to upgrade. |
Concurrent users How throughput holds up. | Limited OLLAMA_NUM_PARALLEL helps; not a serving runtime. | Limited Same ceiling; switch to vLLM for serving. |
Failure modes — what breaks first
Ollama
- Auto-update can ship a regression that breaks your model
- Hidden config knobs — some llama.cpp flags aren't exposed
- WSL backend flakiness on Windows GPU
- Daemon restart loses concurrent state
llama.cpp
- GGUF format drift after major version bumps
- Build flag combinations that compile but produce wrong output
- Manual model file management → broken symlinks
- Vulkan support varies wildly by GPU + driver
Editorial verdict
Default to Ollama. The DX gap is enormous — model management, auto-update, OpenAI-compatible API, and a sane out-of-the-box config make first-success time five minutes instead of an hour.
Switch to llama.cpp when (a) you need custom build flags Ollama doesn't expose (rare for hobby users; common for advanced multi-GPU setups), (b) you need exact reproducibility across machines, or (c) you're shipping a product that embeds inference and you don't want a wrapper layer.
Don't switch to llama.cpp 'because it's faster' — they're the same engine. Performance differences are usually config differences.