14. Monitoring AI Services
Production AI services need monitoring beyond nvidia-smi. You need request latency, token throughput, GPU utilization over time, and alert thresholds.
Prometheus metrics from llama.cpp server (if built with --server and metrics enabled):
# Check if metrics endpoint exists
curl http://localhost:8080/metrics
# # HELP llama_compute_time_total Total time spent computing in nanoseconds
# # TYPE llama_compute_time_total counter
# llama_compute_time_total 1.23e+12
Expose node-level metrics with node_exporter:
docker run -d \
--name node-exporter \
--network host \
prom/node-exporter \
--collector.disable-defaults \
--collector.cuda \
--collector.meminfo \
--collector.cpu
# Test
curl http://localhost:9100/metrics | grep -E 'nvidia_gpu|node_memory'
Grafana dashboard for visualization:
docker run -d \
--name grafana \
--network host \
-e GF_SECURITY_ADMIN_PASSWORD=change-me \
grafana/grafana
Configure Prometheus as a data source in Grafana at http://localhost:3000. Import the CUDA dashboard from grafana.com/dashboards (search "NVIDIA GPU").
Alerting with Prometheus Alertmanager:
# alert_rules.yml
groups:
- name: ai-alerts
rules:
- alert: GPUTemperatureHigh
expr: nvidia_gpu_temperature_celsius > 85
for: 5m
labels:
severity: warning
annotations:
summary: "GPU temperature above 85°C for 5 minutes"
- alert: GPUMemoryHigh
expr: (nvidia_gpu_memory_used_bytes / nvidia_gpu_memory_total_bytes) > 0.95
for: 1m
labels:
severity: critical
annotations:
summary: "GPU memory usage above 95%"
Shell-based monitoring when you do not want to run the full Prometheus stack:
#!/bin/bash
# gpu_monitor.sh
while true; do
TEMP=$(nvidia-smi --query-gpu=temperature.gpu --format=csv,noheader,nounits)
POWER=$(nvidia-smi --query-gpu=power.draw --format=csv,noheader,nounits)
UTIL=$(nvidia-smi --query-gpu=utilization.gpu --format=csv,noheader,nounits)
echo "$(date '+%Y-%m-%d %H:%M:%S') GPU: ${TEMP}°C | ${POWER}W | ${UTIL}% util"
sleep 10
done
Failure mode: Prometheus scrapes return no CUDA metrics. node_exporter needs the --collector.cuda flag and the nvidia-smi binary to be in PATH. Check /var/log/syslog for node_exporter errors. Or the GPU is not visible to the node_exporter process because it runs in a container without --gpus all.
Failure mode: Grafana dashboard shows No data. The Prometheus data source URL is http://localhost:9090 but Grafana runs in a container on a different network. Use --network host for Grafana or set the data source URL to the container's host IP.
Install node_exporter with --collector.cuda, scrape Prometheus, configure Grafana, create a simple dashboard with GPU temperature, utilization, and power draw, and set a firing alert rule for temperature above 85°C.