RUNLOCALAIv38
->Will it run?Best GPUCompareTroubleshootStartLearnPulseModelsHardwareToolsBench
Run check
RUNLOCALAI

Independently operated catalog for local-AI hardware and software. Hand-written verdicts. Source-cited claims. Reproducible commands when we have them.

OP·Fredoline Eruo
DIR
  • Models
  • Hardware
  • Tools
  • Benchmarks
TOOLS
  • Will it run?
  • Compare hardware
  • Cost vs cloud
  • Choose my GPU
  • Prompting kits
  • Quick answers
REF
  • All buyer guides
  • Learn local AI
  • Methodology
  • Glossary
  • Errors KB
  • Trust
EDITOR
  • About
  • Author
  • How we make money
  • Editorial policy
  • Contact
LEGAL
  • Privacy
  • Terms
  • Sitemap
MAIL · MONTHLY DIGEST
Get monthly local AI changes
Monthly recap. No spam.
DISCLOSURE

Some links on this site are affiliate links (Amazon Associates and other first-class retailers). When you buy through them, we earn a small commission at no extra cost to you. Affiliate links do not influence our verdicts — there are cards we rate highly that we don't have affiliate relationships with, and cards that sell well that we refuse to recommend. Read more →

© 2026 runlocalai.coIndependently operated
RUNLOCALAI · v38
  1. >
  2. Home
  3. /Learn
  4. /Courses
  5. /Capstone: Full-Stack AI App
  6. /Ch. 12
Capstone: Full-Stack AI App

12. Monitoring Setup

Chapter 12 of 18 · 20 min
KEY INSIGHT

Monitor what you would debug in an incident—latency, errors, and throughput for every critical path.

Monitoring provides visibility into production behavior. Metrics reveal performance degradation. Logs help diagnose failures. Traces correlate requests across services. Alerting notifies on-call engineers when issues occur.

Prometheus collects metrics. The backend exposes a /metrics endpoint:

# backend/metrics.py
from prometheus_client import Counter, Histogram, Gauge, generate_latest
from fastapi import FastAPI, Response

# Request metrics
REQUEST_COUNT = Counter(
    'http_requests_total',
    'Total HTTP requests',
    ['method', 'endpoint', 'status']
)

REQUEST_LATENCY = Histogram(
    'http_request_duration_seconds',
    'HTTP request latency',
    ['method', 'endpoint'],
    buckets=[0.1, 0.5, 1.0, 2.0, 5.0, 10.0]
)

# Model inference metrics
INFERENCE_COUNT = Counter(
    'inference_requests_total',
    'Total model inference requests',
    ['status']
)

INFERENCE_LATENCY = Histogram(
    'inference_duration_seconds',
    'Model inference duration',
    buckets=[1.0, 5.0, 10.0, 30.0, 60.0, 120.0]
)

TOKEN_COUNT = Histogram(
    'tokens_generated_total',
    'Tokens generated per request',
    buckets=[10, 50, 100, 250, 500, 1000]
)

# Queue metrics
QUEUE_DEPTH = Gauge(
    'processing_queue_depth',
    'Number of documents waiting for processing'
)

ACTIVE_REQUESTS = Gauge(
    'active_inference_requests',
    'Number of currently running inference requests'
)

Prometheus configuration scrapes metrics from each service:

# prometheus.yml
global:
  scrape_interval: 15s

scrape_configs:
  - job_name: 'backend'
    static_configs:
      - targets: ['backend:8000']
    metrics_path: '/metrics'
    
  - job_name: 'model_server'
    static_configs:
      - targets: ['model_server:8080']
    metrics_path: '/metrics'
    
  - job_name: 'nginx'
    static_configs:
      - targets: ['nginx:9113']

Grafana dashboards visualize key metrics. The inference dashboard shows latency percentiles, throughput, error rate, and queue depth. Alerting rules trigger on anomalies:

# alerting rules
groups:
  - name: ai_app_alerts
    rules:
      - alert: HighInferenceLatency
        expr: histogram_quantile(0.95, rate(inference_duration_seconds_bucket[5m])) > 60
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High inference latency detected"
          description: "95th percentile latency is {{ $value }}s"
          
      - alert: ModelServerDown
        expr: up{job="model_server"} == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Model server is down"
          
      - alert: QueueBacklog
        expr: processing_queue_depth > 100
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "Processing queue is backing up"

Logging uses structured JSON for easier parsing. Include trace IDs in every log line for request correlation:

import logging
import json
from contextvars import ContextVar

trace_id: ContextVar[str] = ContextVar('trace_id', default='no-trace')

class StructuredFormatter(logging.Formatter):
    def format(self, record):
        log_data = {
            'timestamp': self.formatTime(record),
            'level': record.levelname,
            'message': record.getMessage(),
            'trace_id': trace_id.get(),
            'service': 'backend'
        }
        if hasattr(record, 'extra'):
            log_data.update(record.extra)
        return json.dumps(log_data)
EXERCISE

Set up Prometheus, Grafana, and a dashboard for the AI application. Add alerts for model server downtime and high latency.

← Chapter 11
CI/CD Pipeline
Chapter 13 →
Documentation