How to implement GPU-based autoscaling with custom metrics
Kubernetes with GPU nodes, custom metrics API
What this does
This guide implements Kubernetes autoscaling driven by GPU-specific metrics — utilization percentage, memory pressure, and inference throughput — rather than CPU or memory alone. GPU metrics from DCGM or nvidia-ml-py are exposed via Prometheus, registered with the Kubernetes custom metrics API, and consumed by the HPA. This allows the cluster to scale inference pods when GPU saturation is detected, avoiding the scenario where CPU is low but GPU is at 100% and requests are piling up.
Steps
Verify DCGM metrics are available in Prometheus:
curl -s "http://prometheus:9090/api/v1/query?query=DCGM_FI_DEV_GPU_UTIL" | jq '.data.result[] | {pod: .metric.pod, value: .value[1]}'Expected output: GPU utilization percentage per pod, e.g.,
{"pod": "vllm-abc", "value": "87"}.Create a Prometheus recording rule that aggregates GPU utilization per Deployment:
groups: - name: gpu_scaling rules: - record: deployment:gpu_utilization_avg expr: | avg by (deployment) ( DCGM_FI_DEV_GPU_UTIL )Register GPU throughput as a custom metric. Add a recording rule for tokens per second:
- record: deployment:inference_tokens_per_sec expr: | sum by (deployment) ( rate(vllm:generation_tokens_total[1m]) )Configure the Prometheus adapter to expose these custom metrics. In
adapter-config.yml:rules: custom: - seriesQuery: 'deployment:gpu_utilization_avg' resources: overrides: deployment: {resource: "deployment"} metricsQuery: avg(deployment:gpu_utilization_avg) by (deployment) - seriesQuery: 'deployment:inference_tokens_per_sec' resources: overrides: deployment: {resource: "deployment"} metricsQuery: sum(deployment:inference_tokens_per_sec) by (deployment)Apply:
helm upgrade prometheus-adapter prometheus-community/prometheus-adapter -f adapter-config.ymlVerify the custom metrics are visible to Kubernetes:
kubectl get --raw /apis/custom.metrics.k8s.io/v1beta1/namespaces/default/deployments/ai-inference/deployment_gpu_utilization_avg | jq '.'Expected output: JSON containing the metric name and current value.
Create an HPA that scales on both GPU utilization and inference throughput:
apiVersion: autoscaling/v2 kind: HorizontalPodAutoscaler metadata: name: ai-inference-gpu-hpa spec: scaleTargetRef: apiVersion: apps/v1 kind: Deployment name: ai-inference minReplicas: 1 maxReplicas: 8 metrics: - type: Pods pods: metric: name: deployment_gpu_utilization_avg target: type: AverageValue averageValue: "75" - type: Pods pods: metric: name: deployment_inference_tokens_per_sec target: type: AverageValue averageValue: "500" behavior: scaleUp: stabilizationWindowSeconds: 60 policies: - type: Pods value: 1 periodSeconds: 60 scaleDown: stabilizationWindowSeconds: 300Apply and observe the HPA:
kubectl apply -f gpu-hpa.yaml kubectl get hpa ai-inference-gpu-hpa -wGenerate GPU load to trigger scaling. Use sustained inference requests:
for i in $(seq 1 500); do curl -s -X POST http://inference/v1/completions -d '{"prompt":"long text...","max_tokens":512}' & done waitObserve the HPA increase replicas as GPU utilization crosses 75%.
Verification
kubectl get --raw /apis/custom.metrics.k8s.io/v1beta1/ | jq '.resources[] | select(.name | contains("gpu")) | .name'
Expected output: custom metric names containing "gpu" (e.g., pods/deployment_gpu_utilization_avg).
Common failures
- DCGM metrics missing in Prometheus — DCGM Exporter may not have scraped the GPU node yet. Check:
kubectl logs -n monitoring deployment/dcgm-exporter | grep "Starting". Also verify Prometheus scrape config targets the DCGM exporter. - Custom metric not found by HPA — the metric name must match exactly between the Prometheus adapter config and the HPA spec. Use
kubectl get --raw /apis/custom.metrics.k8s.io/v1beta1/ | jq '.resources[].name'to list available metrics. - HPA scales on CPU instead of GPU — if the Deployment has high CPU requests, the default HPA behavior prioritizes CPU over custom metrics. Remove CPU/memory resource metrics from the HPA spec and rely solely on GPU custom metrics, or add weighted scaling behavior.
- GPU nodes are fully utilized — the HPA requests more pods, but no nodes have available GPUs. Implement a cluster autoscaler with GPU node groups to provision additional GPU nodes on demand.