How to benchmark model response time using the Ollama API
Ollama running on localhost or a reachable host, curl or Python installed
What this does
Measures end-to-end latency from HTTP request dispatch to complete response receipt using command-line timing tools and API metadata. After this guide a reproducible wall-clock benchmark and tokens-per-second metric will be available for any model on the current hardware.
Steps
Send a request and measure total wall-clock time. Captures end-to-end request duration using the
timecommand.time curl -s http://localhost:11434/api/generate -d '{ "model": "llama3:q4_K_M", "prompt": "Explain quantum entanglement in one sentence.", "stream": false }' | jq .Expected output: JSON response body followed by
realtime showing total elapsed seconds.Parse timing fields from the API response. The Ollama API returns
eval_count(tokens generated) andeval_duration(nanoseconds spent generating).curl -s http://localhost:11434/api/generate -d '{ "model": "llama3:q4_K_M", "prompt": "Explain quantum entanglement in one sentence.", "stream": false }' | jq '{eval_count, eval_duration}'Expected output: JSON object with numeric values for tokens and duration.
Calculate tokens per second from these fields. Divides eval_count by eval_duration after converting nanoseconds to seconds.
curl -s http://localhost:11434/api/generate -d '{"model":"llama3:q4_K_M","prompt":"Count from one to ten.","stream":false}' | jq '.eval_count / (.eval_duration / 1e9)'Expected output: A floating-point number representing tokens per second.
- Record the local run evidence. Save the exact command, runtime or package version, model name if applicable, and observed output so the result can be reproduced later.
Verification
time curl -s http://localhost:11434/api/generate -d '{"model":"llama3:q4_K_M","prompt":"Count from one to five.","stream":false}' | jq .
# Expected: JSON response with "response" field and wall-clock time displayed
Common failures
- connection refused - Ollama service is not running or URL is wrong; start with
ollama serve. - empty response body - Model name is incorrect or request format is invalid; check JSON payload keys.
jqcommand not found - Install jq via package manager or parse JSON with Python instead.- stream mode missing timing fields - Set
"stream": falsefor benchmark runs to get eval_count and eval_duration. - high variance across runs - Cold-start effects and system load contribute to outliers; run three iterations and discard the first.