14. Cluster Monitoring
Observability across GPU utilization, memory pressure, inference latency, and node health enables capacity planning and failure prevention.
Prometheus Stack Installation
kube-prometheus-stack provides Prometheus, Alertmanager, and Grafana:
helm install prometheus prometheus-community/kube-prometheus-stack \
--namespace monitoring \
--create-namespace \
--set prometheus.prometheusSpec.retention=30d \
--set grafana.persistence.size=10Gi
Access Grafana with port forwarding:
kubectl port-forward -n monitoring svc/prometheus-grafana 3000:80
# Default credentials: admin / prom-operator (check secret)
kubectl get secret prometheus-grafana -n monitoring -o jsonpath='{.data.admin-password}' | base64 -d
GPU Metrics Collection
Enable DCGM metrics export alongside kube-prometheus-stack:
# Install DCGM exporter as daemonset
helm repo add gpu-operator https://helm.ngc.nvidia.com/nvidia
helm install dcgm-exporter gpu-operator/dcgm-exporter \
--namespace monitoring \
--set serviceMonitor.enabled=true \
--set serviceMonitor.namespace=monitoring
Key dashboard panels include DCGM_FI_DEV_GPU_UTIL for training utilization, DCGM_FI_DEV_FB_USED for frame buffer memory, and DCGM_FI_DEV_GPU_TEMP for thermal monitoring.
Custom Inference Metrics
Expose application metrics via Prometheus client library:
from prometheus_client import Counter, Histogram, start_http_server
REQUEST_COUNT = Counter('inference_requests_total', 'Total inference requests', ['model', 'status'])
REQUEST_LATENCY = Histogram('inference_request_seconds', 'Request latency', ['model'])
# In inference endpoint
REQUEST_COUNT.labels(model='llama-3-8b', status='success').inc()
with REQUEST_LATENCY.labels(model='llama-3-8b').time():
result = model.generate(prompt)
Scrape these metrics by adding the pod endpoint to podMonitor or serviceMonitor:
apiVersion: monitoring.coreos.com/v1
kind: PodMonitor
metadata:
name: inference-monitor
spec:
selector:
matchLabels:
app: llama-inference
podMetricsEndpoints:
- port: metrics
interval: 15s
Alerting Configuration
Route alerts based on GPU utilization patterns:
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: gpu-alerts
spec:
groups:
- name: gpu.rules
rules:
- alert: GPUUnderutilized
expr: DCGM_FI_DEV_GPU_UTIL < 20
for: 30m
labels:
severity: warning
annotations:
summary: "GPU utilization below 20% for 30 minutes"
description: "Node {{ $labels.instance }} GPU utilization is low, consider consolidation"
- alert: GPUUtilizationHigh
expr: DCGM_FI_DEV_GPU_UTIL > 95
for: 5m
labels:
severity: critical
annotations:
summary: "GPU utilization above 95%"
Install kube-prometheus-stack, enable DCGM metrics, load a model and run inference while observing the Grafana dashboard. Create a custom alert for when GPU utilization drops below 10%.