14. Prometheus Metrics

Chapter 14 of 24 · 20 min

KEY INSIGHT

Metrics collection overhead must remain below 1% of inference compute capacity; aggressive sampling and aggregated histograms prevent instrumentation from becoming a bottleneck. ### Prometheus Integration The Triton Inference Server exposes Prometheus metrics on port 8002: ```yaml # prometheus.yml global: scrape_interval: 15s evaluation_interval: 15s scrape_configs: - job_name: 'inference-servers' static_configs: - targets: - 'model-server-1:8002' - 'model-server-2:8002' - 'model-server-3:8002' metrics_path: /metrics relabel_configs: - source_labels: [__address__] target_label: instance regex: '(.*):\d+' replacement: '${1}' ``` ### Custom Metrics with Prometheus Client Expose application-specific metrics for model inference: ```python from prometheus_client import Counter, Histogram, Gauge, generate_latest from starlette.applications import Starlette from starlette.routing import Route # Request metrics inference_requests = Counter( 'inference_requests_total', 'Total inference requests', ['model_name', 'status'] ) inference_latency = Histogram( 'inference_latency_seconds', 'Inference request latency', ['model_name'], buckets=(0.1, 0.25, 0.5, 1.0, 2.5, 5.0, 10.0) ) # Resource metrics gpu_memory_used = Gauge( 'gpu_memory_used_bytes', 'GPU memory currently in use', ['device_id'] ) request_queue_depth = Gauge( 'inference_queue_depth', 'Number of requests waiting for processing', ['model_name'] ) async def metrics_endpoint(request): return Response( content=generate_latest(), media_type='text/plain' ) routes = [ Route('/metrics', metrics_endpoint), ] ``` ### Alerting Rules Define alerting thresholds for inference infrastructure: ```yaml groups: - name: inference_alerts rules: - alert: HighInferenceLatency expr: histogram_quantile(0.95, inference_latency_seconds) > 5.0 for: 5m labels: severity: warning annotations: summary: "P95 latency exceeds 5 seconds" - alert: GPUMemoryExhausted expr: (gpu_memory_used_bytes / gpu_memory_total_bytes) > 0.95 for: 1m labels: severity: critical annotations: summary: "GPU memory usage above 95%" - alert: RequestQueueBacklog expr: inference_queue_depth > 100 for: 3m labels: severity: warning annotations: summary: "Request queue depth critical" ```

Observability infrastructure must capture metrics at granularity sufficient for production debugging and capacity planning. Inference serving exposes distinct metric categories: request latencies, resource utilization, model performance characteristics, and queue dynamics.

Local verification checkpoint

Run the smallest example from this chapter in a local workspace and record the package version, runtime, data path, and observed output. If the result depends on model size, vector count, CPU/GPU backend, or available memory, note that constraint beside the exercise so the lesson remains reproducible.

EXERCISE

Instrument a Python inference service with Prometheus metrics covering request count, latency histogram, and GPU utilization. Deploy Prometheus and Grafana locally. Configure alerting rules that trigger when P95 latency exceeds 2 seconds. Verify alerts fire correctly by generating load with artificial delays.