10. Concurrent Requests
Ollama's REST API can handle multiple requests simultaneously. Understanding how it manages concurrency helps you design systems that scale.
Default Behavior
Ollama processes requests sequentially by default. With one model loaded, requests wait for the current generation to complete. You can see this by sending multiple requests:
time curl http://localhost:11434/api/generate -d '{"model":"llama3.2:1b","prompt":"Count to 100","stream":false}' &
time curl http://localhost:11434/api/generate -d '{"model":"llama3.2:1b","prompt":"Count to 100","stream":false}' &
wait
Both requests complete sequentially, doubling the total time.
Concurrent Model Loading
If you have enough memory, load multiple models to serve different types of requests:
ollama run llama3.2:1b &
ollama run codellama:7b &
Now requests to different models process in parallel. Requests to the same model still queue.
Queue Management
Ollama queues excess requests when all available model slots are busy. Queue timeout defaults to 5 minutes. After the timeout, the API returns an error:
{
"error": "model request timeout"
}
To increase the timeout:
export OLLAMA_REQUEST_TIMEOUT=600
ollama serve
Thread Configuration
The OLLAMA_NUM_PARALLEL environment variable controls how many requests can process simultaneously per model:
export OLLAMA_NUM_PARALLEL=4
ollama serve
This allows up to 4 concurrent generations for the same model, sharing GPU memory. The tradeoff is increased latency per request due to context switching.
Load Testing
Use ab (Apache Bench) or wrk to test concurrency:
# Install Apache Bench (Debian/Ubuntu)
sudo apt install apache2-utils
# Test with 10 concurrent requests
ab -n 50 -c 10 -p request.json -T application/json http://localhost:11434/api/generate
Where request.json contains:
{"model":"llama3.2:1b","prompt":"Hello","stream":false}
Monitor system resources with nvidia-smi or top during the test to identify bottlenecks.
Load a model, then use curl in a loop to send 10 requests with stream:false. Measure total time and compare to sequential requests to understand queue behavior.