14. Cluster Monitoring

Chapter 14 of 18 · 20 min

Observability across GPU utilization, memory pressure, inference latency, and node health enables capacity planning and failure prevention.

Prometheus Stack Installation

kube-prometheus-stack provides Prometheus, Alertmanager, and Grafana:

helm install prometheus prometheus-community/kube-prometheus-stack \
  --namespace monitoring \
  --create-namespace \
  --set prometheus.prometheusSpec.retention=30d \
  --set grafana.persistence.size=10Gi

Access Grafana with port forwarding:

kubectl port-forward -n monitoring svc/prometheus-grafana 3000:80
# Default credentials: admin / prom-operator (check secret)
kubectl get secret prometheus-grafana -n monitoring -o jsonpath='{.data.admin-password}' | base64 -d

GPU Metrics Collection

Enable DCGM metrics export alongside kube-prometheus-stack:

# Install DCGM exporter as daemonset
helm repo add gpu-operator https://helm.ngc.nvidia.com/nvidia
helm install dcgm-exporter gpu-operator/dcgm-exporter \
  --namespace monitoring \
  --set serviceMonitor.enabled=true \
  --set serviceMonitor.namespace=monitoring

Key dashboard panels include DCGM_FI_DEV_GPU_UTIL for training utilization, DCGM_FI_DEV_FB_USED for frame buffer memory, and DCGM_FI_DEV_GPU_TEMP for thermal monitoring.

Custom Inference Metrics

Expose application metrics via Prometheus client library:

from prometheus_client import Counter, Histogram, start_http_server

REQUEST_COUNT = Counter('inference_requests_total', 'Total inference requests', ['model', 'status'])
REQUEST_LATENCY = Histogram('inference_request_seconds', 'Request latency', ['model'])

# In inference endpoint
REQUEST_COUNT.labels(model='llama-3-8b', status='success').inc()
with REQUEST_LATENCY.labels(model='llama-3-8b').time():
    result = model.generate(prompt)

Scrape these metrics by adding the pod endpoint to podMonitor or serviceMonitor:

apiVersion: monitoring.coreos.com/v1
kind: PodMonitor
metadata:
  name: inference-monitor
spec:
  selector:
    matchLabels:
      app: llama-inference
  podMetricsEndpoints:
  - port: metrics
    interval: 15s

Alerting Configuration

Route alerts based on GPU utilization patterns:

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: gpu-alerts
spec:
  groups:
  - name: gpu.rules
    rules:
    - alert: GPUUnderutilized
      expr: DCGM_FI_DEV_GPU_UTIL < 20
      for: 30m
      labels:
        severity: warning
      annotations:
        summary: "GPU utilization below 20% for 30 minutes"
        description: "Node {{ $labels.instance }} GPU utilization is low, consider consolidation"
    - alert: GPUUtilizationHigh
      expr: DCGM_FI_DEV_GPU_UTIL > 95
      for: 5m
      labels:
        severity: critical
      annotations:
        summary: "GPU utilization above 95%"
EXERCISE

Install kube-prometheus-stack, enable DCGM metrics, load a model and run inference while observing the Grafana dashboard. Create a custom alert for when GPU utilization drops below 10%.