12. Monitoring Setup
Monitoring provides visibility into production behavior. Metrics reveal performance degradation. Logs help diagnose failures. Traces correlate requests across services. Alerting notifies on-call engineers when issues occur.
Prometheus collects metrics. The backend exposes a /metrics endpoint:
# backend/metrics.py
from prometheus_client import Counter, Histogram, Gauge, generate_latest
from fastapi import FastAPI, Response
# Request metrics
REQUEST_COUNT = Counter(
'http_requests_total',
'Total HTTP requests',
['method', 'endpoint', 'status']
)
REQUEST_LATENCY = Histogram(
'http_request_duration_seconds',
'HTTP request latency',
['method', 'endpoint'],
buckets=[0.1, 0.5, 1.0, 2.0, 5.0, 10.0]
)
# Model inference metrics
INFERENCE_COUNT = Counter(
'inference_requests_total',
'Total model inference requests',
['status']
)
INFERENCE_LATENCY = Histogram(
'inference_duration_seconds',
'Model inference duration',
buckets=[1.0, 5.0, 10.0, 30.0, 60.0, 120.0]
)
TOKEN_COUNT = Histogram(
'tokens_generated_total',
'Tokens generated per request',
buckets=[10, 50, 100, 250, 500, 1000]
)
# Queue metrics
QUEUE_DEPTH = Gauge(
'processing_queue_depth',
'Number of documents waiting for processing'
)
ACTIVE_REQUESTS = Gauge(
'active_inference_requests',
'Number of currently running inference requests'
)
Prometheus configuration scrapes metrics from each service:
# prometheus.yml
global:
scrape_interval: 15s
scrape_configs:
- job_name: 'backend'
static_configs:
- targets: ['backend:8000']
metrics_path: '/metrics'
- job_name: 'model_server'
static_configs:
- targets: ['model_server:8080']
metrics_path: '/metrics'
- job_name: 'nginx'
static_configs:
- targets: ['nginx:9113']
Grafana dashboards visualize key metrics. The inference dashboard shows latency percentiles, throughput, error rate, and queue depth. Alerting rules trigger on anomalies:
# alerting rules
groups:
- name: ai_app_alerts
rules:
- alert: HighInferenceLatency
expr: histogram_quantile(0.95, rate(inference_duration_seconds_bucket[5m])) > 60
for: 5m
labels:
severity: warning
annotations:
summary: "High inference latency detected"
description: "95th percentile latency is {{ $value }}s"
- alert: ModelServerDown
expr: up{job="model_server"} == 0
for: 1m
labels:
severity: critical
annotations:
summary: "Model server is down"
- alert: QueueBacklog
expr: processing_queue_depth > 100
for: 10m
labels:
severity: warning
annotations:
summary: "Processing queue is backing up"
Logging uses structured JSON for easier parsing. Include trace IDs in every log line for request correlation:
import logging
import json
from contextvars import ContextVar
trace_id: ContextVar[str] = ContextVar('trace_id', default='no-trace')
class StructuredFormatter(logging.Formatter):
def format(self, record):
log_data = {
'timestamp': self.formatTime(record),
'level': record.levelname,
'message': record.getMessage(),
'trace_id': trace_id.get(),
'service': 'backend'
}
if hasattr(record, 'extra'):
log_data.update(record.extra)
return json.dumps(log_data)
Set up Prometheus, Grafana, and a dashboard for the AI application. Add alerts for model server downtime and high latency.