What this does

This guide adds automatic Prometheus metrics to a FastAPI AI inference service using prometheus-fastapi-instrumentator. The instrumentor auto-generates HTTP request metrics (latency histograms, request counts, status codes) and exposes a /metrics endpoint. Custom business metrics — inference latency by model, token counts, error counters — are layered on top using the same Prometheus client library for a unified monitoring surface.

Steps

Install the packages:
```
pip install prometheus-fastapi-instrumentator prometheus-client
```
Expected output: Successfully installed prometheus-fastapi-instrumentator-7.0.0 prometheus-client-0.19.0.

Add the instrumentor to the FastAPI app. In main.py:

from fastapi import FastAPI
from prometheus_fastapi_instrumentator import Instrumentator

app = FastAPI(title="AI Inference Service")
instrumentator = Instrumentator()
instrumentator.instrument(app).expose(app)

Customize which HTTP metrics are collected. To add request body size tracking:

instrumentator.add(
    metrics.request_size(
        should_include_handler=True,
        should_include_method=True,
    )
)

Add custom inference metrics using the Prometheus client:

from prometheus_client import Histogram, Counter

inference_latency = Histogram(
    "ai_inference_duration_seconds", "Inference latency",
    ["model"], buckets=[0.1, 0.5, 1, 2.5, 5, 10, 30]
)
inference_total = Counter("ai_inference_requests_total", "Total inference requests", ["model", "status"])

Record metrics inside the inference endpoint:

@app.post("/v1/infer")
async def infer(request: InferRequest):
    start = time.monotonic()
    try:
        result = await model.generate(request.prompt)
        inference_total.labels(model=request.model, status="success").inc()
        return result
    except Exception:
        inference_total.labels(model=request.model, status="error").inc()
        raise
    finally:
        inference_latency.labels(model=request.model).observe(time.monotonic() - start)

Start the server and confirm the metrics endpoint:
```
uvicorn main:app --host 0.0.0.0 --port 8000 &
curl -s http://localhost:8000/metrics | head -20
```
Expected output: lines including http_requests_total, ai_inference_duration_seconds_bucket, and ai_inference_requests_total.

Configure Prometheus to scrape the service by adding to prometheus.yml:

- job_name: "ai-inference"
  scrape_interval: 15s
  static_configs:
    - targets: ["ai-service:8000"]

Verification

curl -s http://localhost:8000/metrics | grep -c "ai_inference"

Expected output: 3 or more (confirming the custom inference metrics are present alongside HTTP metrics).

Common failures

/metrics returns 404 — the instrumentor's expose(app) call must come after instrument(app). Swap the order and restart.
Custom metrics appear as zero — the inference endpoint hasn't been called yet. Send a test request: curl -X POST http://localhost:8000/v1/infer -H 'Content-Type: application/json' -d '{"model":"test","prompt":"hello"}'.
Prometheus scrapes fail with "connection refused" — verify the service binds to 0.0.0.0, not 127.0.0.1, if Prometheus runs on a different host or container.
Histogram buckets are flat — check that observe() is called inside the finally block to guarantee timing even on errors.

How to instrument a Python FastAPI AI service with Prometheus metrics

What this does

Steps

Verification

Common failures

Related guides