12. Horizontal Pod Autoscaling
Horizontal Pod Autoscaler scales pod replicas based on measured utilization metrics. The autoscaler adjusts replica count within configurable min and max bounds, responding to CPU usage, memory consumption, or custom metrics from the Metrics API.
Pod scaling requires resource utilization metrics from the Metrics Server or custom metric pipelines. The Metrics Server provides CPU and memory metrics through the metrics.k8s.io API. Custom metrics require the custom.metrics.k8s.io API implemented by solutions like Prometheus Adapter.
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: inference-server-hpa
namespace: ai-inference
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: inference-server
minReplicas: 2
maxReplicas: 20
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
- type: Resource
resource:
name: memory
target:
type: Utilization
averageUtilization: 80
behavior:
scaleDown:
stabilizationWindowSeconds: 300
policies:
- type: Percent
value: 10
periodSeconds: 60
scaleUp:
stabilizationWindowSeconds: 0
policies:
- type: Percent
value: 100
periodSeconds: 15
Custom metrics enable AI-workload-aware autoscaling. Inference-specific metrics like queue depth, average inference latency, or batch availability inform scaling decisions better than generic CPU metrics. The Prometheus Adapter transforms Prometheus metrics into HPA-compatible formats.
# HPA With custom metrics
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: inference-server-hpa
namespace: ai-inference
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: inference-server
minReplicas: 3
maxReplicas: 50
metrics:
- type: Pods
pods:
metric:
name: inference_queue_depth
target:
type: AverageValue
averageValue: "50"
- type: External
external:
metric:
name: gpu_utilization_avg
selector:
matchLabels:
deployment: inference-server
target:
type: AverageValue
averageValue: "70"
Scale stabilization windows prevent oscillation during transient load spikes. The stabilizationWindowSeconds setting delays scale-down decisions, avoiding premature pod termination during brief traffic decreases. Scale-up stabilization defaults to zero for fast response.
Behavior policies control scaling rate limits. Pods scaled down too quickly can cause connection draining issues for in-flight requests. Percent-based policies relate to the current replica count, preventing dramatic percent changes in single scaling events.
# View HPA status
kubectl get hpa -n ai-inference
kubectl describe hpa inference-server-hpa -n ai-inference
# View current metrics
kubectl get hpa inference-server-hpa \
-n ai-inference -o yaml \
| grep -A30 "status:"
# Manual scale trigger for testing
kubectl run load-generator \
--image=busybox \
-- /bin/sh -c "while true; do wget -q -O- \
http://inference-service/infer; done"
Configure autoscaling for an inference deployment using both CPU and custom queue-depth metrics. Deploy Prometheus Adapter to expose queue metrics, create custom metric definitions, configure the HPA with appropriate stabilization windows and rate limits, then generate load to observe scaling behavior.
# Install Prometheus adapter via Helm
helm repo add prometheus-community \
https://prometheus-community.github.io/helm-charts
helm install prometheus-adapter \
prometheus-community/prometheus-adapter \
-n ai-inference \
--set prometheus.url=http://prometheus-server:9090
# Verify custom metrics availability
kubectl get --raw="/apis/custom.metrics.k8s.io/v1beta1/" \
| jq '.resources[].name'
# Apply HPA configuration
kubectl apply -f hpa-config.yaml
# Generate load test
kubectl run siege \
--image=xp--prod.siege \
--replicas=5 \
-- /bin/sh -c "while true; do \
curl -s http://inference-service/infer; done"
# Observe scaling
watch kubectl get hpa,pods -n ai-inference