01. What is Ollama?
Ollama is a local model serving runtime that downloads, configures, and runs large language models on your own hardware. It wraps model weights in a standardized format and exposes a REST API and CLI for interacting with them. The tool handles the complexity of model loading, tokenization, and inference without requiring you to write Python code or manage CUDA kernels directly.
Ollama stores models in a local directory (typically ~/.ollama/models on Linux and macOS, %USERPROFILE%\\.ollama\\models on Windows). When you run a model, Ollama loads it into memory, manages the inference pipeline, and exposes endpoints for generating text, chatting, and creating embeddings. Models run as long-running processes that maintain state across requests.
The architecture consists of three layers:
Model registry - The command
ollama pulldownloads model manifests and weights from ollama.com. Each model has a name (likellama3.2:3b) and a size. You can inspect available models withollama list.Inference engine - When you run
ollama run, the engine loads the model weights and starts an interactive session. Behind the scenes, Ollama uses llama.cpp for CPU inference and CUDA/Metal for GPU acceleration.API layer - The HTTP server listens on port 11434 by default. You can send POST requests to
/api/generate,/api/chat, and/api/embeddings. The Python library wraps these endpoints.
Local verification checkpoint
Run the smallest example from this chapter in a local workspace and record the package version, runtime, data path, and observed output. If the result depends on model size, vector count, CPU/GPU backend, or available memory, note that constraint beside the exercise so the lesson remains reproducible.
Run ollama --version to verify the binary is installed, then run curl http://localhost:11434 to check if the API is running (expect a JSON response with status field).
# Linux/macOS
ollama --version
curl http://localhost:11434
# Windows PowerShell
ollama --version
Invoke-RestMethod -Uri http://localhost:11434