HOW-TO · INF

How to benchmark token generation speed in tokens per second

intermediate15 minBy Fredoline Eruo
Target environment
Ubuntu 24.04 · Ollama 0.4.x
PREREQUISITES

Ollama or vLLM running with a model loaded

What this does

Extracts token-generation telemetry from the Ollama API response fields and computes a tokens-per-second figure for sustained throughput. After this guide a repeatable benchmark metric will be available for comparing quantization levels, hardware setups, or runtime parameters.

Steps

  1. Retrieve evaluation metrics from a non-streaming API call. The endpoint returns eval_count (output tokens) and eval_duration (time in nanoseconds) when streaming is disabled.

    curl -s http://localhost:11434/api/generate -d '{
      "model": "llama3:q4_K_M",
      "prompt": "Describe the water cycle in three sentences.",
      "stream": false
    }' > /tmp/response.json
    

    Expected output: No terminal output; JSON saved to file.

  2. Extract fields and compute tokens per second. Divides tokens by duration converted from nanoseconds to seconds.

    cat /tmp/response.json | jq '.eval_count / (.eval_duration / 1e9)'
    

    Expected output: A decimal number representing tokens per second, for example 18.4.

  3. Display raw fields for manual verification. Shows eval_count and eval_duration before trusting the calculation.

    cat /tmp/response.json | jq '{tokens: .eval_count, duration_sec: .eval_duration / 1e9}'
    

    Expected output: JSON with token count and duration in seconds.

  4. Run a batch to get stable aggregate figures. Averages multiple runs to smooth cold-start variance.

    for run in 1 2 3; do curl -s http://localhost:11434/api/generate -d '{"model":"llama3:q4_K_M","prompt":"Name three colors.","stream":false}' | jq -r '.eval_count / (.eval_duration / 1e9)'; done
    

    Expected output: Three numbers; average the last two for a stable result.

Verification

curl -s http://localhost:11434/api/generate -d '{"model":"llama3:q4_K_M","prompt":"Answer with exactly the word hello.","stream":false}' | jq '{tps: .eval_count / (.eval_duration / 1e9)}'
# Expected: JSON with "tps" key and a positive floating-point value

Common failures

  • eval_count or eval_duration missing - Streaming mode was used; set "stream": false.
  • zero tokens reported - Prompt may have been rejected by safety filters; check the response field for error messages.
  • extremely low TPS value - System may be under memory pressure or running on limited CPU cores; monitor resources during the run.
  • inconsistent results across runs - First run after model load is typically slower; always warm up and discard the first result.

Related guides

RELATED GUIDES