HOW-TO · OPS

How to implement GPU-based autoscaling with custom metrics

advanced35 minBy Fredoline Eruo
Target environment
Ubuntu 24.04 · Ollama 0.4.x
PREREQUISITES

Kubernetes with GPU nodes, custom metrics API

What this does

This guide implements Kubernetes autoscaling driven by GPU-specific metrics — utilization percentage, memory pressure, and inference throughput — rather than CPU or memory alone. GPU metrics from DCGM or nvidia-ml-py are exposed via Prometheus, registered with the Kubernetes custom metrics API, and consumed by the HPA. This allows the cluster to scale inference pods when GPU saturation is detected, avoiding the scenario where CPU is low but GPU is at 100% and requests are piling up.

Steps

  1. Verify DCGM metrics are available in Prometheus:

    curl -s "http://prometheus:9090/api/v1/query?query=DCGM_FI_DEV_GPU_UTIL" | jq '.data.result[] | {pod: .metric.pod, value: .value[1]}'
    

    Expected output: GPU utilization percentage per pod, e.g., {"pod": "vllm-abc", "value": "87"}.

  2. Create a Prometheus recording rule that aggregates GPU utilization per Deployment:

    groups:
      - name: gpu_scaling
        rules:
          - record: deployment:gpu_utilization_avg
            expr: |
              avg by (deployment) (
                DCGM_FI_DEV_GPU_UTIL
              )
    
  3. Register GPU throughput as a custom metric. Add a recording rule for tokens per second:

          - record: deployment:inference_tokens_per_sec
            expr: |
              sum by (deployment) (
                rate(vllm:generation_tokens_total[1m])
              )
    
  4. Configure the Prometheus adapter to expose these custom metrics. In adapter-config.yml:

    rules:
      custom:
        - seriesQuery: 'deployment:gpu_utilization_avg'
          resources:
            overrides:
              deployment: {resource: "deployment"}
          metricsQuery: avg(deployment:gpu_utilization_avg) by (deployment)
        - seriesQuery: 'deployment:inference_tokens_per_sec'
          resources:
            overrides:
              deployment: {resource: "deployment"}
          metricsQuery: sum(deployment:inference_tokens_per_sec) by (deployment)
    

    Apply: helm upgrade prometheus-adapter prometheus-community/prometheus-adapter -f adapter-config.yml

  5. Verify the custom metrics are visible to Kubernetes:

    kubectl get --raw /apis/custom.metrics.k8s.io/v1beta1/namespaces/default/deployments/ai-inference/deployment_gpu_utilization_avg | jq '.'
    

    Expected output: JSON containing the metric name and current value.

  6. Create an HPA that scales on both GPU utilization and inference throughput:

    apiVersion: autoscaling/v2
    kind: HorizontalPodAutoscaler
    metadata:
      name: ai-inference-gpu-hpa
    spec:
      scaleTargetRef:
        apiVersion: apps/v1
        kind: Deployment
        name: ai-inference
      minReplicas: 1
      maxReplicas: 8
      metrics:
        - type: Pods
          pods:
            metric:
              name: deployment_gpu_utilization_avg
            target:
              type: AverageValue
              averageValue: "75"
        - type: Pods
          pods:
            metric:
              name: deployment_inference_tokens_per_sec
            target:
              type: AverageValue
              averageValue: "500"
      behavior:
        scaleUp:
          stabilizationWindowSeconds: 60
          policies:
            - type: Pods
              value: 1
              periodSeconds: 60
        scaleDown:
          stabilizationWindowSeconds: 300
    
  7. Apply and observe the HPA:

    kubectl apply -f gpu-hpa.yaml
    kubectl get hpa ai-inference-gpu-hpa -w
    
  8. Generate GPU load to trigger scaling. Use sustained inference requests:

    for i in $(seq 1 500); do curl -s -X POST http://inference/v1/completions -d '{"prompt":"long text...","max_tokens":512}' & done
    wait
    

    Observe the HPA increase replicas as GPU utilization crosses 75%.

Verification

kubectl get --raw /apis/custom.metrics.k8s.io/v1beta1/ | jq '.resources[] | select(.name | contains("gpu")) | .name'

Expected output: custom metric names containing "gpu" (e.g., pods/deployment_gpu_utilization_avg).

Common failures

  • DCGM metrics missing in Prometheus — DCGM Exporter may not have scraped the GPU node yet. Check: kubectl logs -n monitoring deployment/dcgm-exporter | grep "Starting". Also verify Prometheus scrape config targets the DCGM exporter.
  • Custom metric not found by HPA — the metric name must match exactly between the Prometheus adapter config and the HPA spec. Use kubectl get --raw /apis/custom.metrics.k8s.io/v1beta1/ | jq '.resources[].name' to list available metrics.
  • HPA scales on CPU instead of GPU — if the Deployment has high CPU requests, the default HPA behavior prioritizes CPU over custom metrics. Remove CPU/memory resource metrics from the HPA spec and rely solely on GPU custom metrics, or add weighted scaling behavior.
  • GPU nodes are fully utilized — the HPA requests more pods, but no nodes have available GPUs. Implement a cluster autoscaler with GPU node groups to provision additional GPU nodes on demand.

Related guides