What this does

This guide configures the Kubernetes Horizontal Pod Autoscaler (HPA) to scale AI inference pods based on custom metrics — request queue depth, GPU utilization, and inference latency — in addition to standard CPU and memory. By using the Prometheus adapter to expose application-level metrics to the HPA, the inference service can scale before request queues overflow and without over-provisioning expensive GPU resources.

Steps

Verify the metrics-server is running:
```
kubectl get deployment metrics-server -n kube-system
```
Expected output: deployment with READY 1/1.

Install the Prometheus adapter with custom metrics support:

helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm install prometheus-adapter prometheus-community/prometheus-adapter \
  --set prometheus.url=http://prometheus.monitoring.svc.cluster.local

Configure the Prometheus adapter to expose the AI service's custom metrics. Add a custom metrics rule in adapter-config.yml:

rules:
  custom:
    - seriesQuery: 'ai_request_queue_depth{namespace!=""}'
      resources:
        overrides:
          namespace: {resource: "namespace"}
      name:
        matches: "ai_request_queue_depth"
      metricsQuery: sum(ai_request_queue_depth) by (<<.GroupBy>>)

Add resource requests to the inference Deployment to enable HPA scaling:

containers:
  - name: inference
    resources:
      requests:
        cpu: "2"
        memory: "8Gi"
      limits:
        nvidia.com/gpu: "1"

Create the HPA manifest targeting custom metrics:

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: inference-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: ai-inference
  minReplicas: 1
  maxReplicas: 5
  metrics:
    - type: Pods
      pods:
        metric:
          name: ai_request_queue_depth
        target:
          type: AverageValue
          averageValue: "5"
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 70

Apply the HPA and verify it is active:
```
kubectl apply -f hpa.yaml
kubectl get hpa inference-hpa -w
```
Expected output: columns showing current queue depth against the target of 5.
Generate load to trigger scaling. Use a load generator that sends concurrent inference requests:
```
for i in $(seq 1 100); do curl -X POST http://inference-service/v1/infer -d '{"prompt":"test"}' & done
wait
```
Observe the HPA increase replicas: kubectl get hpa inference-hpa.
Monitor the scale-down behavior. After the load generator completes, wait for the stabilization window (default 300 seconds) and confirm replicas return to the minimum.

Verification

kubectl get hpa inference-hpa -o json | jq '.status.currentReplicas'

Expected output: an integer >= 1, reflecting the current scale.

Common failures

HPA reports "unable to get metric" — the Prometheus adapter is not exposing the custom metric. Check with kubectl get --raw /apis/custom.metrics.k8s.io/v1beta1/ | jq '.resources[] | .name' | grep ai_request.
Replicas never exceed minReplicas — there may be insufficient GPU nodes to schedule additional pods. Check: kubectl get nodes -l accelerator=nvidia and ensure at least maxReplicas GPUs are available.
Scaling is too slow for traffic spikes — reduce the HPA's --horizontal-pod-autoscaler-downscale-stabilization and --horizontal-pod-autoscaler-upscale-stabilization flags on the kube-controller-manager, or switch to KEDA for event-driven scaling.
GPU pods unschedulable after scale-up — the GPU Device Plugin limits one GPU per pod by default. Each GPU node can only run as many GPU pods as it has GPUs.

How to set up horizontal pod autoscaling for AI inference services

What this does

Steps

Verification

Common failures

Related guides