How to monitor LLM response latency percentiles with Prometheus histograms
LLM API endpoint, Prometheus client library
What this does
This guide instruments LLM API calls with Prometheus histogram metrics to capture response latency distributions. Instead of tracking only average latency, histograms expose p50, p95, and p99 percentiles, revealing tail latency issues that degrade user experience. The histogram buckets are tuned for LLM workloads where response times range from hundreds of milliseconds to tens of seconds.
Steps
Define a histogram in a
metrics.pymodule with LLM-appropriate buckets:from prometheus_client import Histogram llm_latency = Histogram( "llm_request_duration_seconds", "LLM API request duration in seconds", ["model", "endpoint"], buckets=[0.1, 0.25, 0.5, 1.0, 2.5, 5.0, 10.0, 20.0, 40.0, 60.0, 120.0] )The bucket boundaries capture fast cached responses (sub-second) through slow completions (60+ seconds).
Wrap each LLM call with a timing decorator or context manager:
import time def timed_llm_call(model, endpoint): start = time.monotonic() try: result = call_llm_api(model, endpoint) return result finally: duration = time.monotonic() - start llm_latency.labels(model=model, endpoint=endpoint).observe(duration)For async callers, use
time.monotonic()consistently:async def async_llm_call(model, endpoint): start = time.monotonic() result = await async_call_llm_api(model, endpoint) llm_latency.labels(model=model, endpoint=endpoint).observe(time.monotonic() - start) return resultExpose the
/metricsendpoint. With FastAPI, useprometheus-fastapi-instrumentatoror a manual route returninggenerate_latest().Confirm the histogram is scraped by Prometheus:
curl http://localhost:9090/api/v1/query?query=llm_request_duration_seconds_countExpected output: JSON with
status: "success"and a count > 0.Query the p95 latency over the last 5 minutes in Grafana:
histogram_quantile(0.95, rate(llm_request_duration_seconds_bucket[5m]))Create a Grafana panel with three queries for p50, p95, and p99:
histogram_quantile(0.50, rate(llm_request_duration_seconds_bucket[5m])) histogram_quantile(0.95, rate(llm_request_duration_seconds_bucket[5m])) histogram_quantile(0.99, rate(llm_request_duration_seconds_bucket[5m]))Set an alert for p99 exceeding a threshold:
- alert: LLMP99LatencyHigh expr: histogram_quantile(0.99, rate(llm_request_duration_seconds_bucket[5m])) > 30 for: 5m annotations: summary: "LLM p99 latency exceeds 30 seconds"
Verification
curl -s http://localhost:8000/metrics | grep "llm_request_duration_seconds_bucket"
Expected output: multiple lines showing bucket labels with cumulative counts, e.g., llm_request_duration_seconds_bucket{le="0.5",model="gpt-4o",endpoint="/v1/chat"} 42.0.
Common failures
- Histogram shows no data — confirm
observe()is called after every LLM request. Add a log line above theobserve()call during debugging. - p99 query returns NaN — insufficient samples in the histogram. Ensure at least 100 observations have been recorded before querying high percentiles.
- Bucket boundaries produce misleading percentiles — if p95 and p99 both fall in the top bucket, add larger buckets (e.g., 180.0, 300.0) to the list.
- Label cardinality explosion — avoid using high-cardinality labels (like request ID or conversation ID) on histogram metrics.