What this does

Measures the time between sending an API request and receiving the first token of a model response. First-token latency is critical for interactive applications such as chatbots and coding assistants where perceived responsiveness drives user experience.

Steps

Confirm the API endpoint is responding. A quick health check prevents wasted debugging time.
```
curl -s http://localhost:11434/api/generate -d '{"model":"llama3.2:3b","prompt":"Hi","stream":false}' | head -c 100
```
Expected output: JSON response with "response" field populated.

Measure time to first byte with curl -w. Captures TTFB using curl write-out format.

curl -s -w "\nTimeTotal: %{time_total}s\n" -X POST http://localhost:11434/api/generate -H "Content-Type: application/json" -d '{"model":"llama3.2:3b","prompt":"Explain quantum computing.","stream":true}' -o /dev/null

Expected output: TimeTotal: ~1.2s (varies by model and hardware).

Run multiple trials and calculate average. Single runs are noisy; 5 runs produce a reliable mean.

for i in {1..5}; do curl -s -w "%{time_total}\n" -X POST http://localhost:11434/api/generate -H "Content-Type: application/json" -d '{"model":"llama3.2:3b","prompt":"Hello","stream":false}' -o /dev/null; done | awk '{sum+=$1; count++} END {print "Average:", sum/count "s"}'

Expected output: Average: 0.XXs.

Record the local run evidence. Save the exact command, runtime or package version, model name if applicable, and observed output so the result can be reproduced later.

Verification

curl -s -w "TTFB: %{time_total}s\n" -X POST http://localhost:11434/api/generate -H "Content-Type: application/json" -d '{"model":"llama3.2:3b","prompt":"Hi","stream":true}' -o /dev/null
# Expected: TTFB less than 2 seconds for a 3B model on modern hardware

Common failures

connection refused - API server is not running; start with ollama serve and confirm port with ss -tlnp | grep 11434.
extremely high latency (10s+) - Model loading into memory for first time; run a warm-up request before benchmarking.
variable results across trials - Cold-start effects are normal; discard the first run and average the remaining trials.

How to benchmark first-token latency for interactive applications

What this does

Steps

Verification

Common failures

Related guides