What this does

This guide instruments LLM API calls with Prometheus histogram metrics to capture response latency distributions. Instead of tracking only average latency, histograms expose p50, p95, and p99 percentiles, revealing tail latency issues that degrade user experience. The histogram buckets are tuned for LLM workloads where response times range from hundreds of milliseconds to tens of seconds.

Steps

Define a histogram in a metrics.py module with LLM-appropriate buckets:

from prometheus_client import Histogram
llm_latency = Histogram(
    "llm_request_duration_seconds",
    "LLM API request duration in seconds",
    ["model", "endpoint"],
    buckets=[0.1, 0.25, 0.5, 1.0, 2.5, 5.0, 10.0, 20.0, 40.0, 60.0, 120.0]
)

The bucket boundaries capture fast cached responses (sub-second) through slow completions (60+ seconds).

Wrap each LLM call with a timing decorator or context manager:

import time
def timed_llm_call(model, endpoint):
    start = time.monotonic()
    try:
        result = call_llm_api(model, endpoint)
        return result
    finally:
        duration = time.monotonic() - start
        llm_latency.labels(model=model, endpoint=endpoint).observe(duration)

For async callers, use time.monotonic() consistently:

async def async_llm_call(model, endpoint):
    start = time.monotonic()
    result = await async_call_llm_api(model, endpoint)
    llm_latency.labels(model=model, endpoint=endpoint).observe(time.monotonic() - start)
    return result

Expose the /metrics endpoint. With FastAPI, use prometheus-fastapi-instrumentator or a manual route returning generate_latest().
Confirm the histogram is scraped by Prometheus:
```
curl http://localhost:9090/api/v1/query?query=llm_request_duration_seconds_count
```
Expected output: JSON with status: "success" and a count > 0.

Query the p95 latency over the last 5 minutes in Grafana:

histogram_quantile(0.95, rate(llm_request_duration_seconds_bucket[5m]))

Create a Grafana panel with three queries for p50, p95, and p99:

histogram_quantile(0.50, rate(llm_request_duration_seconds_bucket[5m]))
histogram_quantile(0.95, rate(llm_request_duration_seconds_bucket[5m]))
histogram_quantile(0.99, rate(llm_request_duration_seconds_bucket[5m]))

Set an alert for p99 exceeding a threshold:

- alert: LLMP99LatencyHigh
  expr: histogram_quantile(0.99, rate(llm_request_duration_seconds_bucket[5m])) > 30
  for: 5m
  annotations:
    summary: "LLM p99 latency exceeds 30 seconds"

Verification

curl -s http://localhost:8000/metrics | grep "llm_request_duration_seconds_bucket"

Expected output: multiple lines showing bucket labels with cumulative counts, e.g., llm_request_duration_seconds_bucket{le="0.5",model="gpt-4o",endpoint="/v1/chat"} 42.0.

Common failures

Histogram shows no data — confirm observe() is called after every LLM request. Add a log line above the observe() call during debugging.
p99 query returns NaN — insufficient samples in the histogram. Ensure at least 100 observations have been recorded before querying high percentiles.
Bucket boundaries produce misleading percentiles — if p95 and p99 both fall in the top bucket, add larger buckets (e.g., 180.0, 300.0) to the list.
Label cardinality explosion — avoid using high-cardinality labels (like request ID or conversation ID) on histogram metrics.

How to monitor LLM response latency percentiles with Prometheus histograms

What this does

Steps

Verification

Common failures

Related guides