HOW-TO · OPS

How to monitor LLM response latency percentiles with Prometheus histograms

intermediate20 minBy Fredoline Eruo
Target environment
Ubuntu 24.04 · Ollama 0.4.x
PREREQUISITES

LLM API endpoint, Prometheus client library

What this does

This guide instruments LLM API calls with Prometheus histogram metrics to capture response latency distributions. Instead of tracking only average latency, histograms expose p50, p95, and p99 percentiles, revealing tail latency issues that degrade user experience. The histogram buckets are tuned for LLM workloads where response times range from hundreds of milliseconds to tens of seconds.

Steps

  1. Define a histogram in a metrics.py module with LLM-appropriate buckets:

    from prometheus_client import Histogram
    llm_latency = Histogram(
        "llm_request_duration_seconds",
        "LLM API request duration in seconds",
        ["model", "endpoint"],
        buckets=[0.1, 0.25, 0.5, 1.0, 2.5, 5.0, 10.0, 20.0, 40.0, 60.0, 120.0]
    )
    

    The bucket boundaries capture fast cached responses (sub-second) through slow completions (60+ seconds).

  2. Wrap each LLM call with a timing decorator or context manager:

    import time
    def timed_llm_call(model, endpoint):
        start = time.monotonic()
        try:
            result = call_llm_api(model, endpoint)
            return result
        finally:
            duration = time.monotonic() - start
            llm_latency.labels(model=model, endpoint=endpoint).observe(duration)
    
  3. For async callers, use time.monotonic() consistently:

    async def async_llm_call(model, endpoint):
        start = time.monotonic()
        result = await async_call_llm_api(model, endpoint)
        llm_latency.labels(model=model, endpoint=endpoint).observe(time.monotonic() - start)
        return result
    
  4. Expose the /metrics endpoint. With FastAPI, use prometheus-fastapi-instrumentator or a manual route returning generate_latest().

  5. Confirm the histogram is scraped by Prometheus:

    curl http://localhost:9090/api/v1/query?query=llm_request_duration_seconds_count
    

    Expected output: JSON with status: "success" and a count > 0.

  6. Query the p95 latency over the last 5 minutes in Grafana:

    histogram_quantile(0.95, rate(llm_request_duration_seconds_bucket[5m]))
    
  7. Create a Grafana panel with three queries for p50, p95, and p99:

    histogram_quantile(0.50, rate(llm_request_duration_seconds_bucket[5m]))
    histogram_quantile(0.95, rate(llm_request_duration_seconds_bucket[5m]))
    histogram_quantile(0.99, rate(llm_request_duration_seconds_bucket[5m]))
    
  8. Set an alert for p99 exceeding a threshold:

    - alert: LLMP99LatencyHigh
      expr: histogram_quantile(0.99, rate(llm_request_duration_seconds_bucket[5m])) > 30
      for: 5m
      annotations:
        summary: "LLM p99 latency exceeds 30 seconds"
    

Verification

curl -s http://localhost:8000/metrics | grep "llm_request_duration_seconds_bucket"

Expected output: multiple lines showing bucket labels with cumulative counts, e.g., llm_request_duration_seconds_bucket{le="0.5",model="gpt-4o",endpoint="/v1/chat"} 42.0.

Common failures

  • Histogram shows no data — confirm observe() is called after every LLM request. Add a log line above the observe() call during debugging.
  • p99 query returns NaN — insufficient samples in the histogram. Ensure at least 100 observations have been recorded before querying high percentiles.
  • Bucket boundaries produce misleading percentiles — if p95 and p99 both fall in the top bucket, add larger buckets (e.g., 180.0, 300.0) to the list.
  • Label cardinality explosion — avoid using high-cardinality labels (like request ID or conversation ID) on histogram metrics.

Related guides