How to set up horizontal pod autoscaling for AI inference services
Kubernetes cluster with metrics-server, Prometheus adapter
What this does
This guide configures the Kubernetes Horizontal Pod Autoscaler (HPA) to scale AI inference pods based on custom metrics — request queue depth, GPU utilization, and inference latency — in addition to standard CPU and memory. By using the Prometheus adapter to expose application-level metrics to the HPA, the inference service can scale before request queues overflow and without over-provisioning expensive GPU resources.
Steps
Verify the metrics-server is running:
kubectl get deployment metrics-server -n kube-systemExpected output: deployment with READY
1/1.Install the Prometheus adapter with custom metrics support:
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts helm install prometheus-adapter prometheus-community/prometheus-adapter \ --set prometheus.url=http://prometheus.monitoring.svc.cluster.localConfigure the Prometheus adapter to expose the AI service's custom metrics. Add a custom metrics rule in
adapter-config.yml:rules: custom: - seriesQuery: 'ai_request_queue_depth{namespace!=""}' resources: overrides: namespace: {resource: "namespace"} name: matches: "ai_request_queue_depth" metricsQuery: sum(ai_request_queue_depth) by (<<.GroupBy>>)Add resource requests to the inference Deployment to enable HPA scaling:
containers: - name: inference resources: requests: cpu: "2" memory: "8Gi" limits: nvidia.com/gpu: "1"Create the HPA manifest targeting custom metrics:
apiVersion: autoscaling/v2 kind: HorizontalPodAutoscaler metadata: name: inference-hpa spec: scaleTargetRef: apiVersion: apps/v1 kind: Deployment name: ai-inference minReplicas: 1 maxReplicas: 5 metrics: - type: Pods pods: metric: name: ai_request_queue_depth target: type: AverageValue averageValue: "5" - type: Resource resource: name: cpu target: type: Utilization averageUtilization: 70Apply the HPA and verify it is active:
kubectl apply -f hpa.yaml kubectl get hpa inference-hpa -wExpected output: columns showing current queue depth against the target of 5.
Generate load to trigger scaling. Use a load generator that sends concurrent inference requests:
for i in $(seq 1 100); do curl -X POST http://inference-service/v1/infer -d '{"prompt":"test"}' & done waitObserve the HPA increase replicas:
kubectl get hpa inference-hpa.Monitor the scale-down behavior. After the load generator completes, wait for the stabilization window (default 300 seconds) and confirm replicas return to the minimum.
Verification
kubectl get hpa inference-hpa -o json | jq '.status.currentReplicas'
Expected output: an integer >= 1, reflecting the current scale.
Common failures
- HPA reports "unable to get metric" — the Prometheus adapter is not exposing the custom metric. Check with
kubectl get --raw /apis/custom.metrics.k8s.io/v1beta1/ | jq '.resources[] | .name' | grep ai_request. - Replicas never exceed minReplicas — there may be insufficient GPU nodes to schedule additional pods. Check:
kubectl get nodes -l accelerator=nvidiaand ensure at leastmaxReplicasGPUs are available. - Scaling is too slow for traffic spikes — reduce the HPA's
--horizontal-pod-autoscaler-downscale-stabilizationand--horizontal-pod-autoscaler-upscale-stabilizationflags on the kube-controller-manager, or switch to KEDA for event-driven scaling. - GPU pods unschedulable after scale-up — the GPU Device Plugin limits one GPU per pod by default. Each GPU node can only run as many GPU pods as it has GPUs.