14. Prometheus Metrics
Observability infrastructure must capture metrics at granularity sufficient for production debugging and capacity planning. Inference serving exposes distinct metric categories: request latencies, resource utilization, model performance characteristics, and queue dynamics.
Local verification checkpoint
Run the smallest example from this chapter in a local workspace and record the package version, runtime, data path, and observed output. If the result depends on model size, vector count, CPU/GPU backend, or available memory, note that constraint beside the exercise so the lesson remains reproducible.
Instrument a Python inference service with Prometheus metrics covering request count, latency histogram, and GPU utilization. Deploy Prometheus and Grafana locally. Configure alerting rules that trigger when P95 latency exceeds 2 seconds. Verify alerts fire correctly by generating load with artificial delays.