Multiple Models — Ollama — Installation to Mastery (Chapter 4)

Ollama supports running multiple models simultaneously. Each model runs as an independent process, and you can switch between them or query them in parallel via the API.

Listing Installed Models

ollama list

Output shows the model name, ID, size, and last modified date:

NAME                    ID           SIZE      MODIFIED
llama3.2:1b             46536d0c3d4d 1.3GB    2024-11-15 10:23:41
llama3.2:3b             a3fe2398f87b 2.0GB    2024-11-15 11:45:12
codellama:7b            f4e2de43f668 3.8GB    2024-11-14 09:12:33

Running Multiple Interactively

You can run multiple ollama run sessions in separate terminals. Each session consumes memory independently. To free up resources for a new model:

# Stop a running model
ollama stop llama3.2:1b

# Check running models
ollama ps

The ollama ps output shows the active model, its memory usage, and when it was loaded:

NAME            ID      SIZE      PROCESSOR    UNTIL
llama3.2:3b     a3fe239 2.0GB     100% GPU     5 minutes ago

API Parallel Requests

The REST API handles concurrent requests. With two models running, you can send requests to different models in parallel:

curl http://localhost:11434/api/generate -d '{"model":"llama3.2:1b","prompt":"Hello","stream":false}' &
curl http://localhost:11434/api/generate -d '{"model":"codellama:7b","prompt":"def hello():","stream":false}' &
wait

Ollama queues requests when GPU memory is constrained. If you see degraded performance with multiple models, you may need to stop unused models or adjust memory allocation.

Copying and Removing Models

# Create a copy with a new name
ollama cp llama3.2:1b my-custom-llama

# Remove a model
ollama rm codellama:7b

Removing a model frees disk space immediately. There is no trash folder-deletion is permanent.

Local verification checkpoint

Run the smallest example from this chapter in a local workspace and record the package version, runtime, data path, and observed output. If the result depends on model size, vector count, CPU/GPU backend, or available memory, note that constraint beside the exercise so the lesson remains reproducible.