HOW-TO · OPS

How to instrument a Python FastAPI AI service with Prometheus metrics

intermediate15 minBy Fredoline Eruo
Target environment
Ubuntu 24.04 · Ollama 0.4.x
PREREQUISITES

FastAPI application, prometheus-fastapi-instrumentator

What this does

This guide adds automatic Prometheus metrics to a FastAPI AI inference service using prometheus-fastapi-instrumentator. The instrumentor auto-generates HTTP request metrics (latency histograms, request counts, status codes) and exposes a /metrics endpoint. Custom business metrics — inference latency by model, token counts, error counters — are layered on top using the same Prometheus client library for a unified monitoring surface.

Steps

  1. Install the packages:

    pip install prometheus-fastapi-instrumentator prometheus-client
    

    Expected output: Successfully installed prometheus-fastapi-instrumentator-7.0.0 prometheus-client-0.19.0.

  2. Add the instrumentor to the FastAPI app. In main.py:

    from fastapi import FastAPI
    from prometheus_fastapi_instrumentator import Instrumentator
    
    app = FastAPI(title="AI Inference Service")
    instrumentator = Instrumentator()
    instrumentator.instrument(app).expose(app)
    
  3. Customize which HTTP metrics are collected. To add request body size tracking:

    instrumentator.add(
        metrics.request_size(
            should_include_handler=True,
            should_include_method=True,
        )
    )
    
  4. Add custom inference metrics using the Prometheus client:

    from prometheus_client import Histogram, Counter
    
    inference_latency = Histogram(
        "ai_inference_duration_seconds", "Inference latency",
        ["model"], buckets=[0.1, 0.5, 1, 2.5, 5, 10, 30]
    )
    inference_total = Counter("ai_inference_requests_total", "Total inference requests", ["model", "status"])
    
  5. Record metrics inside the inference endpoint:

    @app.post("/v1/infer")
    async def infer(request: InferRequest):
        start = time.monotonic()
        try:
            result = await model.generate(request.prompt)
            inference_total.labels(model=request.model, status="success").inc()
            return result
        except Exception:
            inference_total.labels(model=request.model, status="error").inc()
            raise
        finally:
            inference_latency.labels(model=request.model).observe(time.monotonic() - start)
    
  6. Start the server and confirm the metrics endpoint:

    uvicorn main:app --host 0.0.0.0 --port 8000 &
    curl -s http://localhost:8000/metrics | head -20
    

    Expected output: lines including http_requests_total, ai_inference_duration_seconds_bucket, and ai_inference_requests_total.

  7. Configure Prometheus to scrape the service by adding to prometheus.yml:

    - job_name: "ai-inference"
      scrape_interval: 15s
      static_configs:
        - targets: ["ai-service:8000"]
    

Verification

curl -s http://localhost:8000/metrics | grep -c "ai_inference"

Expected output: 3 or more (confirming the custom inference metrics are present alongside HTTP metrics).

Common failures

  • /metrics returns 404 — the instrumentor's expose(app) call must come after instrument(app). Swap the order and restart.
  • Custom metrics appear as zero — the inference endpoint hasn't been called yet. Send a test request: curl -X POST http://localhost:8000/v1/infer -H 'Content-Type: application/json' -d '{"model":"test","prompt":"hello"}'.
  • Prometheus scrapes fail with "connection refused" — verify the service binds to 0.0.0.0, not 127.0.0.1, if Prometheus runs on a different host or container.
  • Histogram buckets are flat — check that observe() is called inside the finally block to guarantee timing even on errors.

Related guides