How to create Grafana panels for vLLM throughput monitoring
Grafana with Prometheus, vLLM running with metrics
What this does
This guide builds Grafana visualization panels for monitoring vLLM inference server throughput: tokens per second, requests per second, time-to-first-token (TTFT), and generation throughput per model. vLLM exposes a rich set of Prometheus metrics at its /metrics endpoint, including iteration-level scheduling statistics and KV-cache usage. The resulting dashboard helps operators tune batch sizes, identify underutilized GPUs, and detect performance regressions after model updates.
Steps
Confirm vLLM metrics are scraped. Query Prometheus for the vLLM-specific metric prefix:
curl -s "http://localhost:9090/api/v1/label/__name__/values" | jq '.data[]' | grep vllmExpected output: lines like
vllm:num_requests_running,vllm:time_to_first_token_seconds_sum.In Grafana, create a new dashboard and add a time-series panel for token throughput. Set the query to:
rate(vllm:generation_tokens_total[1m])Label the panel "Generation Tokens/sec" and set the unit to
tokens/sec.Add a second time-series panel for input (prefill) throughput:
rate(vllm:prompt_tokens_total[1m])Add a stat panel for current running requests:
vllm:num_requests_runningTitle it "Active Requests" with thresholds: green < model_max_num_seqs / 2, yellow < model_max_num_seqs, red >= model_max_num_seqs.
Add a gauge panel for GPU KV-cache usage:
vllm:gpu_cache_usage_percUnit set to
percent (0-100). This identifies memory pressure before requests queue.Add a time-series panel for time-to-first-token (TTFT):
rate(vllm:time_to_first_token_seconds_sum[1m]) / rate(vllm:time_to_first_token_seconds_count[1m])Title "Avg TTFT (seconds)" — this is the user-perceived latency before streaming begins.
Add a panel for request throughput by status:
sum by (status) (rate(vllm:request_success_total[1m]))Display as a stacked bar chart comparing successful vs. failed requests.
Set the dashboard refresh interval to 5 seconds for near-real-time monitoring of inference load.
Verification
curl -s http://localhost:3000/api/dashboards/uid/vllm-throughput -H "Authorization: Bearer <token>" | jq '.dashboard.panels | length'
Expected output: an integer >= 5 (confirming all panels exist).
Common failures
- No vLLM metrics in Prometheus — vLLM's
/metricsendpoint is on a separate port from the API endpoint. Default is 8000 for API but metrics are also exposed at the same port. Verify withcurl http://vllm-server:8000/metrics | grep vllm. - TTFT panel returns NaN — if using prefix caching, TTFT may be extremely fast and sampled infrequently. Ensure
_countis greater than zero by running a few uncached requests. - KV-cache gauge shows 0 — the metric name varies by vLLM version. Check available metrics with the Prometheus label values API and adjust the metric name accordingly.
- Dashboard panels overlap — use the Grafana panel editor to set a fixed panel width: 12 for half-width, 24 for full-width on a standard grid.