RUNLOCALAIv38
->Will it run?Best GPUCompareTroubleshootStartLearnPulseModelsHardwareToolsBench
Run check
RUNLOCALAI

Independently operated catalog for local-AI hardware and software. Hand-written verdicts. Source-cited claims. Reproducible commands when we have them.

OP·Fredoline Eruo
DIR
  • Models
  • Hardware
  • Tools
  • Benchmarks
TOOLS
  • Will it run?
  • Compare hardware
  • Cost vs cloud
  • Choose my GPU
  • Prompting kits
  • Quick answers
REF
  • All buyer guides
  • Learn local AI
  • Methodology
  • Glossary
  • Errors KB
  • Trust
EDITOR
  • About
  • Author
  • How we make money
  • Editorial policy
  • Contact
LEGAL
  • Privacy
  • Terms
  • Sitemap
MAIL · MONTHLY DIGEST
Get monthly local AI changes
Monthly recap. No spam.
DISCLOSURE

Some links on this site are affiliate links (Amazon Associates and other first-class retailers). When you buy through them, we earn a small commission at no extra cost to you. Affiliate links do not influence our verdicts — there are cards we rate highly that we don't have affiliate relationships with, and cards that sell well that we refuse to recommend. Read more →

© 2026 runlocalai.coIndependently operated
RUNLOCALAI · v38
  1. >
  2. Home
  3. /Learn
  4. /Courses
  5. /Ollama — Installation to Mastery
  6. /Ch. 10
Ollama — Installation to Mastery

10. Concurrent Requests

Chapter 10 of 20 · 20 min
KEY INSIGHT

Ollama queues requests by default. Increase parallelism only if you have sufficient GPU memory-otherwise, you trade throughput for latency without actual improvement.

Ollama's REST API can handle multiple requests simultaneously. Understanding how it manages concurrency helps you design systems that scale.

Default Behavior

Ollama processes requests sequentially by default. With one model loaded, requests wait for the current generation to complete. You can see this by sending multiple requests:

time curl http://localhost:11434/api/generate -d '{"model":"llama3.2:1b","prompt":"Count to 100","stream":false}' &
time curl http://localhost:11434/api/generate -d '{"model":"llama3.2:1b","prompt":"Count to 100","stream":false}' &
wait

Both requests complete sequentially, doubling the total time.

Concurrent Model Loading

If you have enough memory, load multiple models to serve different types of requests:

ollama run llama3.2:1b &
ollama run codellama:7b &

Now requests to different models process in parallel. Requests to the same model still queue.

Queue Management

Ollama queues excess requests when all available model slots are busy. Queue timeout defaults to 5 minutes. After the timeout, the API returns an error:

{
  "error": "model request timeout"
}

To increase the timeout:

export OLLAMA_REQUEST_TIMEOUT=600
ollama serve

Thread Configuration

The OLLAMA_NUM_PARALLEL environment variable controls how many requests can process simultaneously per model:

export OLLAMA_NUM_PARALLEL=4
ollama serve

This allows up to 4 concurrent generations for the same model, sharing GPU memory. The tradeoff is increased latency per request due to context switching.

Load Testing

Use ab (Apache Bench) or wrk to test concurrency:

# Install Apache Bench (Debian/Ubuntu)
sudo apt install apache2-utils

# Test with 10 concurrent requests
ab -n 50 -c 10 -p request.json -T application/json http://localhost:11434/api/generate

Where request.json contains:

{"model":"llama3.2:1b","prompt":"Hello","stream":false}

Monitor system resources with nvidia-smi or top during the test to identify bottlenecks.

EXERCISE

Load a model, then use curl in a loop to send 10 requests with stream:false. Measure total time and compare to sequential requests to understand queue behavior.

← Chapter 9
Performance Tuning
Chapter 11 →
Model Management Automation