05. Ollama on Linux
Ollama bundles model weights and a inference server into a single binary with no build step required. On Linux, install via the official script or from the binary tarball.
Install via script:
curl -fsSL https://ollama.ai/install.sh | sh
Install manually if you want a specific version:
OLLAMA_VERSION=0.1.47
curl -fsSL https://github.com/ollama/ollama/releases/download/v${OLLAMA_VERSION}/ollama-linux-amd64 -o /usr/local/bin/ollama
chmod +x /usr/local/bin/ollama
Test it:
ollama serve &
ollama pull llama3.2:1b
curl http://localhost:11434/api/generate -d '{
"model": "llama3.2:1b",
"prompt": "What is 2+2?",
"stream": false
}'
# {"response":"4","context":...,"total_duration":...}
The total_duration field tells you wall-clock time from request to response, not token generation speed. To measure tokens per second, use a streaming response and divide token count by time.
Failure mode: ollama serve starts but returns connection refused on port 11434. Check netstat -tlnp | grep 11434. If nothing is listening, the service may have crashed on startup. Check /var/log/syslog or run journalctl -u ollama if running as a systemd service.
Failure mode: Ollama uses GPU fine but returns error loading model: no such file or directory. This happens when the model is stored in a path that does not exist, typically after a package upgrade moved the model directory. Check cat ~/.ollama/models/manifests/registry.ollama.ai/library/llama3.2:1b to see where weights are stored.
Failure mode: Model download is extremely slow. Ollama's default pulls are from the Ollama library over HTTPS. If you have a local network cache or a mirror, you cannot redirect it in Ollama. For faster local deployment, download the GGUF file directly and ollama create mymodel -f ./Modelfile pointing at the GGUF path.
Local verification checkpoint
Run the smallest example from this chapter in a local workspace and record the package version, runtime, data path, and observed output. If the result depends on model size, vector count, CPU/GPU backend, or available memory, note that constraint beside the exercise so the lesson remains reproducible.
Install Ollama, pull a small model (1B parameter), make a non-streaming API call, and measure the total_duration field. Run it again with streaming and calculate tokens per second manually.