06. Running AI Containers in Docker

Chapter 6 of 15 · 20 min

Docker containers let you run AI tooling without fighting dependency conflicts. The Ollama Docker image is the simplest starting point. From inside WSL2:

docker run -d \
  --name ollama \
  -p 11434:11434 \
  -v ollama_data:/root/.ollama \
  --gpus all \
  ollama/ollama:latest

The --gpus all flag passes NVIDIA GPU access to the container. This requires the NVIDIA Container Toolkit to be installed in the host WSL2 environment. If you see "docker: Error response from daemon: could not select device driver "nvidia" in this container runtime", install the toolkit:

distribution=$(. /etc/os-release && echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add -
curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | \
  sudo tee /etc/apt/sources.list.d/nvidia-docker.list
sudo apt-get update && sudo apt-get install -y nvidia-container-toolkit
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker

After installation, the --gpus all flag works. Pull a model and verify:

docker exec -it ollama ollama pull llama3.2:1b
docker exec -it ollama ollama run llama3.2:1b "Hello"

If the model runs but inference is slow (under 5 tokens/second on an RTX 3060), check that CUDA is actually available inside the container:

docker exec -it ollama nvidia-smi

If nvidia-smi fails inside the container but works on the host, the NVIDIA Container Toolkit installation failed or Docker's daemon was not restarted after installation.

Local verification checkpoint

Run the smallest example from this chapter in a local workspace and record the package version, runtime, data path, and observed output. If the result depends on model size, vector count, CPU/GPU backend, or available memory, note that constraint beside the exercise so the lesson remains reproducible.

EXERCISE

Run docker run --rm --gpus all nvidia/cuda:12.5.0-base-ubuntu22.04 nvidia-smi inside WSL2. Confirm the GPU appears in the output. Remove the container. Then run the Ollama container and confirm GPU access through Ollama's model list command.