HOW-TO · OPS

How to create Grafana panels for vLLM throughput monitoring

intermediate25 minBy Fredoline Eruo
Target environment
Ubuntu 24.04 · Ollama 0.4.x
PREREQUISITES

Grafana with Prometheus, vLLM running with metrics

What this does

This guide builds Grafana visualization panels for monitoring vLLM inference server throughput: tokens per second, requests per second, time-to-first-token (TTFT), and generation throughput per model. vLLM exposes a rich set of Prometheus metrics at its /metrics endpoint, including iteration-level scheduling statistics and KV-cache usage. The resulting dashboard helps operators tune batch sizes, identify underutilized GPUs, and detect performance regressions after model updates.

Steps

  1. Confirm vLLM metrics are scraped. Query Prometheus for the vLLM-specific metric prefix:

    curl -s "http://localhost:9090/api/v1/label/__name__/values" | jq '.data[]' | grep vllm
    

    Expected output: lines like vllm:num_requests_running, vllm:time_to_first_token_seconds_sum.

  2. In Grafana, create a new dashboard and add a time-series panel for token throughput. Set the query to:

    rate(vllm:generation_tokens_total[1m])
    

    Label the panel "Generation Tokens/sec" and set the unit to tokens/sec.

  3. Add a second time-series panel for input (prefill) throughput:

    rate(vllm:prompt_tokens_total[1m])
    
  4. Add a stat panel for current running requests:

    vllm:num_requests_running
    

    Title it "Active Requests" with thresholds: green < model_max_num_seqs / 2, yellow < model_max_num_seqs, red >= model_max_num_seqs.

  5. Add a gauge panel for GPU KV-cache usage:

    vllm:gpu_cache_usage_perc
    

    Unit set to percent (0-100). This identifies memory pressure before requests queue.

  6. Add a time-series panel for time-to-first-token (TTFT):

    rate(vllm:time_to_first_token_seconds_sum[1m]) / rate(vllm:time_to_first_token_seconds_count[1m])
    

    Title "Avg TTFT (seconds)" — this is the user-perceived latency before streaming begins.

  7. Add a panel for request throughput by status:

    sum by (status) (rate(vllm:request_success_total[1m]))
    

    Display as a stacked bar chart comparing successful vs. failed requests.

  8. Set the dashboard refresh interval to 5 seconds for near-real-time monitoring of inference load.

Verification

curl -s http://localhost:3000/api/dashboards/uid/vllm-throughput -H "Authorization: Bearer <token>" | jq '.dashboard.panels | length'

Expected output: an integer >= 5 (confirming all panels exist).

Common failures

  • No vLLM metrics in Prometheus — vLLM's /metrics endpoint is on a separate port from the API endpoint. Default is 8000 for API but metrics are also exposed at the same port. Verify with curl http://vllm-server:8000/metrics | grep vllm.
  • TTFT panel returns NaN — if using prefix caching, TTFT may be extremely fast and sampled infrequently. Ensure _count is greater than zero by running a few uncached requests.
  • KV-cache gauge shows 0 — the metric name varies by vLLM version. Check available metrics with the Prometheus label values API and adjust the metric name accordingly.
  • Dashboard panels overlap — use the Grafana panel editor to set a fixed panel width: 12 for half-width, 24 for full-width on a standard grid.

Related guides