What this does

Extracts token-generation telemetry from the Ollama API response fields and computes a tokens-per-second figure for sustained throughput. After this guide a repeatable benchmark metric will be available for comparing quantization levels, hardware setups, or runtime parameters.

Steps

Retrieve evaluation metrics from a non-streaming API call. The endpoint returns eval_count (output tokens) and eval_duration (time in nanoseconds) when streaming is disabled.
```
curl -s http://localhost:11434/api/generate -d '{
  "model": "llama3:q4_K_M",
  "prompt": "Describe the water cycle in three sentences.",
  "stream": false
}' > /tmp/response.json
```
Expected output: No terminal output; JSON saved to file.
Extract fields and compute tokens per second. Divides tokens by duration converted from nanoseconds to seconds.
```
cat /tmp/response.json | jq '.eval_count / (.eval_duration / 1e9)'
```
Expected output: A decimal number representing tokens per second, for example 18.4.
Display raw fields for manual verification. Shows eval_count and eval_duration before trusting the calculation.
```
cat /tmp/response.json | jq '{tokens: .eval_count, duration_sec: .eval_duration / 1e9}'
```
Expected output: JSON with token count and duration in seconds.

Run a batch to get stable aggregate figures. Averages multiple runs to smooth cold-start variance.

for run in 1 2 3; do curl -s http://localhost:11434/api/generate -d '{"model":"llama3:q4_K_M","prompt":"Name three colors.","stream":false}' | jq -r '.eval_count / (.eval_duration / 1e9)'; done

Expected output: Three numbers; average the last two for a stable result.

Verification

curl -s http://localhost:11434/api/generate -d '{"model":"llama3:q4_K_M","prompt":"Answer with exactly the word hello.","stream":false}' | jq '{tps: .eval_count / (.eval_duration / 1e9)}'
# Expected: JSON with "tps" key and a positive floating-point value

Common failures

eval_count or eval_duration missing - Streaming mode was used; set "stream": false.
zero tokens reported - Prompt may have been rejected by safety filters; check the response field for error messages.
extremely low TPS value - System may be under memory pressure or running on limited CPU cores; monitor resources during the run.
inconsistent results across runs - First run after model load is typically slower; always warm up and discard the first result.

How to benchmark token generation speed in tokens per second

What this does

Steps

Verification

Common failures

Related guides