Ollama on Linux — Local AI on Linux (Chapter 5)

Ollama bundles model weights and a inference server into a single binary with no build step required. On Linux, install via the official script or from the binary tarball.

Install via script:

curl -fsSL https://ollama.ai/install.sh | sh

Install manually if you want a specific version:

OLLAMA_VERSION=0.1.47
curl -fsSL https://github.com/ollama/ollama/releases/download/v${OLLAMA_VERSION}/ollama-linux-amd64 -o /usr/local/bin/ollama
chmod +x /usr/local/bin/ollama

Test it:

ollama serve &
ollama pull llama3.2:1b
curl http://localhost:11434/api/generate -d '{
  "model": "llama3.2:1b",
  "prompt": "What is 2+2?",
  "stream": false
}'
# {"response":"4","context":...,"total_duration":...}

The total_duration field tells you wall-clock time from request to response, not token generation speed. To measure tokens per second, use a streaming response and divide token count by time.

Failure mode: ollama serve starts but returns connection refused on port 11434. Check netstat -tlnp | grep 11434. If nothing is listening, the service may have crashed on startup. Check /var/log/syslog or run journalctl -u ollama if running as a systemd service.

Failure mode: Ollama uses GPU fine but returns error loading model: no such file or directory. This happens when the model is stored in a path that does not exist, typically after a package upgrade moved the model directory. Check cat ~/.ollama/models/manifests/registry.ollama.ai/library/llama3.2:1b to see where weights are stored.

Failure mode: Model download is extremely slow. Ollama's default pulls are from the Ollama library over HTTPS. If you have a local network cache or a mirror, you cannot redirect it in Ollama. For faster local deployment, download the GGUF file directly and ollama create mymodel -f ./Modelfile pointing at the GGUF path.

Local verification checkpoint

Run the smallest example from this chapter in a local workspace and record the package version, runtime, data path, and observed output. If the result depends on model size, vector count, CPU/GPU backend, or available memory, note that constraint beside the exercise so the lesson remains reproducible.