How to benchmark token generation speed in tokens per second
Ollama or vLLM running with a model loaded
What this does
Extracts token-generation telemetry from the Ollama API response fields and computes a tokens-per-second figure for sustained throughput. After this guide a repeatable benchmark metric will be available for comparing quantization levels, hardware setups, or runtime parameters.
Steps
Retrieve evaluation metrics from a non-streaming API call. The endpoint returns
eval_count(output tokens) andeval_duration(time in nanoseconds) when streaming is disabled.curl -s http://localhost:11434/api/generate -d '{ "model": "llama3:q4_K_M", "prompt": "Describe the water cycle in three sentences.", "stream": false }' > /tmp/response.jsonExpected output: No terminal output; JSON saved to file.
Extract fields and compute tokens per second. Divides tokens by duration converted from nanoseconds to seconds.
cat /tmp/response.json | jq '.eval_count / (.eval_duration / 1e9)'Expected output: A decimal number representing tokens per second, for example
18.4.Display raw fields for manual verification. Shows eval_count and eval_duration before trusting the calculation.
cat /tmp/response.json | jq '{tokens: .eval_count, duration_sec: .eval_duration / 1e9}'Expected output: JSON with token count and duration in seconds.
Run a batch to get stable aggregate figures. Averages multiple runs to smooth cold-start variance.
for run in 1 2 3; do curl -s http://localhost:11434/api/generate -d '{"model":"llama3:q4_K_M","prompt":"Name three colors.","stream":false}' | jq -r '.eval_count / (.eval_duration / 1e9)'; doneExpected output: Three numbers; average the last two for a stable result.
Verification
curl -s http://localhost:11434/api/generate -d '{"model":"llama3:q4_K_M","prompt":"Answer with exactly the word hello.","stream":false}' | jq '{tps: .eval_count / (.eval_duration / 1e9)}'
# Expected: JSON with "tps" key and a positive floating-point value
Common failures
eval_countoreval_durationmissing - Streaming mode was used; set"stream": false.- zero tokens reported - Prompt may have been rejected by safety filters; check the response field for error messages.
- extremely low TPS value - System may be under memory pressure or running on limited CPU cores; monitor resources during the run.
- inconsistent results across runs - First run after model load is typically slower; always warm up and discard the first result.