How to instrument a Python FastAPI AI service with Prometheus metrics
FastAPI application, prometheus-fastapi-instrumentator
What this does
This guide adds automatic Prometheus metrics to a FastAPI AI inference service using prometheus-fastapi-instrumentator. The instrumentor auto-generates HTTP request metrics (latency histograms, request counts, status codes) and exposes a /metrics endpoint. Custom business metrics — inference latency by model, token counts, error counters — are layered on top using the same Prometheus client library for a unified monitoring surface.
Steps
Install the packages:
pip install prometheus-fastapi-instrumentator prometheus-clientExpected output:
Successfully installed prometheus-fastapi-instrumentator-7.0.0 prometheus-client-0.19.0.Add the instrumentor to the FastAPI app. In
main.py:from fastapi import FastAPI from prometheus_fastapi_instrumentator import Instrumentator app = FastAPI(title="AI Inference Service") instrumentator = Instrumentator() instrumentator.instrument(app).expose(app)Customize which HTTP metrics are collected. To add request body size tracking:
instrumentator.add( metrics.request_size( should_include_handler=True, should_include_method=True, ) )Add custom inference metrics using the Prometheus client:
from prometheus_client import Histogram, Counter inference_latency = Histogram( "ai_inference_duration_seconds", "Inference latency", ["model"], buckets=[0.1, 0.5, 1, 2.5, 5, 10, 30] ) inference_total = Counter("ai_inference_requests_total", "Total inference requests", ["model", "status"])Record metrics inside the inference endpoint:
@app.post("/v1/infer") async def infer(request: InferRequest): start = time.monotonic() try: result = await model.generate(request.prompt) inference_total.labels(model=request.model, status="success").inc() return result except Exception: inference_total.labels(model=request.model, status="error").inc() raise finally: inference_latency.labels(model=request.model).observe(time.monotonic() - start)Start the server and confirm the metrics endpoint:
uvicorn main:app --host 0.0.0.0 --port 8000 & curl -s http://localhost:8000/metrics | head -20Expected output: lines including
http_requests_total,ai_inference_duration_seconds_bucket, andai_inference_requests_total.Configure Prometheus to scrape the service by adding to
prometheus.yml:- job_name: "ai-inference" scrape_interval: 15s static_configs: - targets: ["ai-service:8000"]
Verification
curl -s http://localhost:8000/metrics | grep -c "ai_inference"
Expected output: 3 or more (confirming the custom inference metrics are present alongside HTTP metrics).
Common failures
/metricsreturns 404 — the instrumentor'sexpose(app)call must come afterinstrument(app). Swap the order and restart.- Custom metrics appear as zero — the inference endpoint hasn't been called yet. Send a test request:
curl -X POST http://localhost:8000/v1/infer -H 'Content-Type: application/json' -d '{"model":"test","prompt":"hello"}'. - Prometheus scrapes fail with "connection refused" — verify the service binds to
0.0.0.0, not127.0.0.1, if Prometheus runs on a different host or container. - Histogram buckets are flat — check that
observe()is called inside thefinallyblock to guarantee timing even on errors.