How to benchmark first-token latency for interactive applications
Ollama or another compatible API endpoint running on localhost (default port 11434)
What this does
Measures the time between sending an API request and receiving the first token of a model response. First-token latency is critical for interactive applications such as chatbots and coding assistants where perceived responsiveness drives user experience.
Steps
Confirm the API endpoint is responding. A quick health check prevents wasted debugging time.
curl -s http://localhost:11434/api/generate -d '{"model":"llama3.2:3b","prompt":"Hi","stream":false}' | head -c 100Expected output: JSON response with "response" field populated.
Measure time to first byte with curl -w. Captures TTFB using curl write-out format.
curl -s -w "\nTimeTotal: %{time_total}s\n" -X POST http://localhost:11434/api/generate -H "Content-Type: application/json" -d '{"model":"llama3.2:3b","prompt":"Explain quantum computing.","stream":true}' -o /dev/nullExpected output:
TimeTotal: ~1.2s(varies by model and hardware).Run multiple trials and calculate average. Single runs are noisy; 5 runs produce a reliable mean.
for i in {1..5}; do curl -s -w "%{time_total}\n" -X POST http://localhost:11434/api/generate -H "Content-Type: application/json" -d '{"model":"llama3.2:3b","prompt":"Hello","stream":false}' -o /dev/null; done | awk '{sum+=$1; count++} END {print "Average:", sum/count "s"}'Expected output:
Average: 0.XXs.
- Record the local run evidence. Save the exact command, runtime or package version, model name if applicable, and observed output so the result can be reproduced later.
Verification
curl -s -w "TTFB: %{time_total}s\n" -X POST http://localhost:11434/api/generate -H "Content-Type: application/json" -d '{"model":"llama3.2:3b","prompt":"Hi","stream":true}' -o /dev/null
# Expected: TTFB less than 2 seconds for a 3B model on modern hardware
Common failures
- connection refused - API server is not running; start with
ollama serveand confirm port withss -tlnp | grep 11434. - extremely high latency (10s+) - Model loading into memory for first time; run a warm-up request before benchmarking.
- variable results across trials - Cold-start effects are normal; discard the first run and average the remaining trials.