What is Ollama? — Ollama — Installation to Mastery (Chapter 1)

Ollama is a local model serving runtime that downloads, configures, and runs large language models on your own hardware. It wraps model weights in a standardized format and exposes a REST API and CLI for interacting with them. The tool handles the complexity of model loading, tokenization, and inference without requiring you to write Python code or manage CUDA kernels directly.

Ollama stores models in a local directory (typically ~/.ollama/models on Linux and macOS, %USERPROFILE%\\.ollama\\models on Windows). When you run a model, Ollama loads it into memory, manages the inference pipeline, and exposes endpoints for generating text, chatting, and creating embeddings. Models run as long-running processes that maintain state across requests.

The architecture consists of three layers:

Model registry - The command ollama pull downloads model manifests and weights from ollama.com. Each model has a name (like llama3.2:3b) and a size. You can inspect available models with ollama list.
Inference engine - When you run ollama run, the engine loads the model weights and starts an interactive session. Behind the scenes, Ollama uses llama.cpp for CPU inference and CUDA/Metal for GPU acceleration.
API layer - The HTTP server listens on port 11434 by default. You can send POST requests to /api/generate, /api/chat, and /api/embeddings. The Python library wraps these endpoints.

Local verification checkpoint

Run the smallest example from this chapter in a local workspace and record the package version, runtime, data path, and observed output. If the result depends on model size, vector count, CPU/GPU backend, or available memory, note that constraint beside the exercise so the lesson remains reproducible.