Concurrent Requests — Ollama — Installation to Mastery (Chapter 10)

Ollama's REST API can handle multiple requests simultaneously. Understanding how it manages concurrency helps you design systems that scale.

Default Behavior

Ollama processes requests sequentially by default. With one model loaded, requests wait for the current generation to complete. You can see this by sending multiple requests:

time curl http://localhost:11434/api/generate -d '{"model":"llama3.2:1b","prompt":"Count to 100","stream":false}' &
time curl http://localhost:11434/api/generate -d '{"model":"llama3.2:1b","prompt":"Count to 100","stream":false}' &
wait

Both requests complete sequentially, doubling the total time.

Concurrent Model Loading

If you have enough memory, load multiple models to serve different types of requests:

ollama run llama3.2:1b &
ollama run codellama:7b &

Now requests to different models process in parallel. Requests to the same model still queue.

Queue Management

Ollama queues excess requests when all available model slots are busy. Queue timeout defaults to 5 minutes. After the timeout, the API returns an error:

{
  "error": "model request timeout"
}

To increase the timeout:

export OLLAMA_REQUEST_TIMEOUT=600
ollama serve

Thread Configuration

The OLLAMA_NUM_PARALLEL environment variable controls how many requests can process simultaneously per model:

export OLLAMA_NUM_PARALLEL=4
ollama serve

This allows up to 4 concurrent generations for the same model, sharing GPU memory. The tradeoff is increased latency per request due to context switching.

Load Testing

Use ab (Apache Bench) or wrk to test concurrency:

# Install Apache Bench (Debian/Ubuntu)
sudo apt install apache2-utils

# Test with 10 concurrent requests
ab -n 50 -c 10 -p request.json -T application/json http://localhost:11434/api/generate

Where request.json contains:

{"model":"llama3.2:1b","prompt":"Hello","stream":false}

Monitor system resources with nvidia-smi or top during the test to identify bottlenecks.