Monitoring AI Services — Local AI on Linux (Chapter 14)

Production AI services need monitoring beyond nvidia-smi. You need request latency, token throughput, GPU utilization over time, and alert thresholds.

Prometheus metrics from llama.cpp server (if built with --server and metrics enabled):

# Check if metrics endpoint exists
curl http://localhost:8080/metrics
# # HELP llama_compute_time_total Total time spent computing in nanoseconds
# # TYPE llama_compute_time_total counter
# llama_compute_time_total 1.23e+12

Expose node-level metrics with node_exporter:

docker run -d \
  --name node-exporter \
  --network host \
  prom/node-exporter \
  --collector.disable-defaults \
  --collector.cuda \
  --collector.meminfo \
  --collector.cpu

# Test
curl http://localhost:9100/metrics | grep -E 'nvidia_gpu|node_memory'

Grafana dashboard for visualization:

docker run -d \
  --name grafana \
  --network host \
  -e GF_SECURITY_ADMIN_PASSWORD=change-me \
  grafana/grafana

Configure Prometheus as a data source in Grafana at http://localhost:3000. Import the CUDA dashboard from grafana.com/dashboards (search "NVIDIA GPU").

Alerting with Prometheus Alertmanager:

# alert_rules.yml
groups:
  - name: ai-alerts
    rules:
      - alert: GPUTemperatureHigh
        expr: nvidia_gpu_temperature_celsius > 85
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "GPU temperature above 85°C for 5 minutes"

      - alert: GPUMemoryHigh
        expr: (nvidia_gpu_memory_used_bytes / nvidia_gpu_memory_total_bytes) > 0.95
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "GPU memory usage above 95%"

Shell-based monitoring when you do not want to run the full Prometheus stack:

#!/bin/bash
# gpu_monitor.sh
while true; do
  TEMP=$(nvidia-smi --query-gpu=temperature.gpu --format=csv,noheader,nounits)
  POWER=$(nvidia-smi --query-gpu=power.draw --format=csv,noheader,nounits)
  UTIL=$(nvidia-smi --query-gpu=utilization.gpu --format=csv,noheader,nounits)
  echo "$(date '+%Y-%m-%d %H:%M:%S') GPU: ${TEMP}°C | ${POWER}W | ${UTIL}% util"
  sleep 10
done

Failure mode: Prometheus scrapes return no CUDA metrics. node_exporter needs the --collector.cuda flag and the nvidia-smi binary to be in PATH. Check /var/log/syslog for node_exporter errors. Or the GPU is not visible to the node_exporter process because it runs in a container without --gpus all.

Failure mode: Grafana dashboard shows No data. The Prometheus data source URL is http://localhost:9090 but Grafana runs in a container on a different network. Use --network host for Grafana or set the data source URL to the container's host IP.