What this does

This guide implements Kubernetes autoscaling driven by GPU-specific metrics — utilization percentage, memory pressure, and inference throughput — rather than CPU or memory alone. GPU metrics from DCGM or nvidia-ml-py are exposed via Prometheus, registered with the Kubernetes custom metrics API, and consumed by the HPA. This allows the cluster to scale inference pods when GPU saturation is detected, avoiding the scenario where CPU is low but GPU is at 100% and requests are piling up.

Steps

Verify DCGM metrics are available in Prometheus:
```
curl -s "http://prometheus:9090/api/v1/query?query=DCGM_FI_DEV_GPU_UTIL" | jq '.data.result[] | {pod: .metric.pod, value: .value[1]}'
```
Expected output: GPU utilization percentage per pod, e.g., {"pod": "vllm-abc", "value": "87"}.

Create a Prometheus recording rule that aggregates GPU utilization per Deployment:

groups:
  - name: gpu_scaling
    rules:
      - record: deployment:gpu_utilization_avg
        expr: |
          avg by (deployment) (
            DCGM_FI_DEV_GPU_UTIL
          )

      - record: deployment:inference_tokens_per_sec
        expr: |
          sum by (deployment) (
            rate(vllm:generation_tokens_total[1m])
          )

Configure the Prometheus adapter to expose these custom metrics. In adapter-config.yml:

rules:
  custom:
    - seriesQuery: 'deployment:gpu_utilization_avg'
      resources:
        overrides:
          deployment: {resource: "deployment"}
      metricsQuery: avg(deployment:gpu_utilization_avg) by (deployment)
    - seriesQuery: 'deployment:inference_tokens_per_sec'
      resources:
        overrides:
          deployment: {resource: "deployment"}
      metricsQuery: sum(deployment:inference_tokens_per_sec) by (deployment)

Apply: helm upgrade prometheus-adapter prometheus-community/prometheus-adapter -f adapter-config.yml

Verify the custom metrics are visible to Kubernetes:

kubectl get --raw /apis/custom.metrics.k8s.io/v1beta1/namespaces/default/deployments/ai-inference/deployment_gpu_utilization_avg | jq '.'

Expected output: JSON containing the metric name and current value.

Create an HPA that scales on both GPU utilization and inference throughput:

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: ai-inference-gpu-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: ai-inference
  minReplicas: 1
  maxReplicas: 8
  metrics:
    - type: Pods
      pods:
        metric:
          name: deployment_gpu_utilization_avg
        target:
          type: AverageValue
          averageValue: "75"
    - type: Pods
      pods:
        metric:
          name: deployment_inference_tokens_per_sec
        target:
          type: AverageValue
          averageValue: "500"
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 60
      policies:
        - type: Pods
          value: 1
          periodSeconds: 60
    scaleDown:
      stabilizationWindowSeconds: 300

Apply and observe the HPA:

kubectl apply -f gpu-hpa.yaml
kubectl get hpa ai-inference-gpu-hpa -w

Generate GPU load to trigger scaling. Use sustained inference requests:
```
for i in $(seq 1 500); do curl -s -X POST http://inference/v1/completions -d '{"prompt":"long text...","max_tokens":512}' & done
wait
```
Observe the HPA increase replicas as GPU utilization crosses 75%.

Verification

kubectl get --raw /apis/custom.metrics.k8s.io/v1beta1/ | jq '.resources[] | select(.name | contains("gpu")) | .name'

Expected output: custom metric names containing "gpu" (e.g., pods/deployment_gpu_utilization_avg).

Common failures

DCGM metrics missing in Prometheus — DCGM Exporter may not have scraped the GPU node yet. Check: kubectl logs -n monitoring deployment/dcgm-exporter | grep "Starting". Also verify Prometheus scrape config targets the DCGM exporter.
Custom metric not found by HPA — the metric name must match exactly between the Prometheus adapter config and the HPA spec. Use kubectl get --raw /apis/custom.metrics.k8s.io/v1beta1/ | jq '.resources[].name' to list available metrics.
HPA scales on CPU instead of GPU — if the Deployment has high CPU requests, the default HPA behavior prioritizes CPU over custom metrics. Remove CPU/memory resource metrics from the HPA spec and rely solely on GPU custom metrics, or add weighted scaling behavior.
GPU nodes are fully utilized — the HPA requests more pods, but no nodes have available GPUs. Implement a cluster autoscaler with GPU node groups to provision additional GPU nodes on demand.

How to implement GPU-based autoscaling with custom metrics

What this does

Steps

Verification

Common failures

Related guides