RUNLOCALAIv38
->Will it run?Best GPUCompareTroubleshootStartLearnPulseModelsHardwareToolsBench
Run check
RUNLOCALAI

Independently operated catalog for local-AI hardware and software. Hand-written verdicts. Source-cited claims. Reproducible commands when we have them.

OP·Fredoline Eruo
DIR
  • Models
  • Hardware
  • Tools
  • Benchmarks
TOOLS
  • Will it run?
  • Compare hardware
  • Cost vs cloud
  • Choose my GPU
  • Prompting kits
  • Quick answers
REF
  • All buyer guides
  • Learn local AI
  • Methodology
  • Glossary
  • Errors KB
  • Trust
EDITOR
  • About
  • Author
  • How we make money
  • Editorial policy
  • Contact
LEGAL
  • Privacy
  • Terms
  • Sitemap
MAIL · MONTHLY DIGEST
Get monthly local AI changes
Monthly recap. No spam.
DISCLOSURE

Some links on this site are affiliate links (Amazon Associates and other first-class retailers). When you buy through them, we earn a small commission at no extra cost to you. Affiliate links do not influence our verdicts — there are cards we rate highly that we don't have affiliate relationships with, and cards that sell well that we refuse to recommend. Read more →

© 2026 runlocalai.coIndependently operated
RUNLOCALAI · v38
  1. >
  2. Home
  3. /Learn
  4. /Courses
  5. /Local AI on Linux
  6. /Ch. 14
Local AI on Linux

14. Monitoring AI Services

Chapter 14 of 15 · 20 min
KEY INSIGHT

GPU temperature, memory utilization, and inference latency form the minimum viable monitoring stack—you cannot optimize what you do not measure.

Production AI services need monitoring beyond nvidia-smi. You need request latency, token throughput, GPU utilization over time, and alert thresholds.

Prometheus metrics from llama.cpp server (if built with --server and metrics enabled):

# Check if metrics endpoint exists
curl http://localhost:8080/metrics
# # HELP llama_compute_time_total Total time spent computing in nanoseconds
# # TYPE llama_compute_time_total counter
# llama_compute_time_total 1.23e+12

Expose node-level metrics with node_exporter:

docker run -d \
  --name node-exporter \
  --network host \
  prom/node-exporter \
  --collector.disable-defaults \
  --collector.cuda \
  --collector.meminfo \
  --collector.cpu

# Test
curl http://localhost:9100/metrics | grep -E 'nvidia_gpu|node_memory'

Grafana dashboard for visualization:

docker run -d \
  --name grafana \
  --network host \
  -e GF_SECURITY_ADMIN_PASSWORD=change-me \
  grafana/grafana

Configure Prometheus as a data source in Grafana at http://localhost:3000. Import the CUDA dashboard from grafana.com/dashboards (search "NVIDIA GPU").

Alerting with Prometheus Alertmanager:

# alert_rules.yml
groups:
  - name: ai-alerts
    rules:
      - alert: GPUTemperatureHigh
        expr: nvidia_gpu_temperature_celsius > 85
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "GPU temperature above 85°C for 5 minutes"

      - alert: GPUMemoryHigh
        expr: (nvidia_gpu_memory_used_bytes / nvidia_gpu_memory_total_bytes) > 0.95
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "GPU memory usage above 95%"

Shell-based monitoring when you do not want to run the full Prometheus stack:

#!/bin/bash
# gpu_monitor.sh
while true; do
  TEMP=$(nvidia-smi --query-gpu=temperature.gpu --format=csv,noheader,nounits)
  POWER=$(nvidia-smi --query-gpu=power.draw --format=csv,noheader,nounits)
  UTIL=$(nvidia-smi --query-gpu=utilization.gpu --format=csv,noheader,nounits)
  echo "$(date '+%Y-%m-%d %H:%M:%S') GPU: ${TEMP}°C | ${POWER}W | ${UTIL}% util"
  sleep 10
done

Failure mode: Prometheus scrapes return no CUDA metrics. node_exporter needs the --collector.cuda flag and the nvidia-smi binary to be in PATH. Check /var/log/syslog for node_exporter errors. Or the GPU is not visible to the node_exporter process because it runs in a container without --gpus all.

Failure mode: Grafana dashboard shows No data. The Prometheus data source URL is http://localhost:9090 but Grafana runs in a container on a different network. Use --network host for Grafana or set the data source URL to the container's host IP.

EXERCISE

Install node_exporter with --collector.cuda, scrape Prometheus, configure Grafana, create a simple dashboard with GPU temperature, utilization, and power draw, and set a firing alert rule for temperature above 85°C.

← Chapter 13
Remote Access Setup
Chapter 15 →
Linux AI Workstation Project