Horizontal Pod Autoscaling — Production Local AI Deployment (Chapter 12)

Horizontal Pod Autoscaler scales pod replicas based on measured utilization metrics. The autoscaler adjusts replica count within configurable min and max bounds, responding to CPU usage, memory consumption, or custom metrics from the Metrics API.

Pod scaling requires resource utilization metrics from the Metrics Server or custom metric pipelines. The Metrics Server provides CPU and memory metrics through the metrics.k8s.io API. Custom metrics require the custom.metrics.k8s.io API implemented by solutions like Prometheus Adapter.

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: inference-server-hpa
  namespace: ai-inference
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: inference-server
  minReplicas: 2
  maxReplicas: 20
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 70
    - type: Resource
      resource:
        name: memory
        target:
          type: Utilization
          averageUtilization: 80
  behavior:
    scaleDown:
      stabilizationWindowSeconds: 300
      policies:
        - type: Percent
          value: 10
          periodSeconds: 60
    scaleUp:
      stabilizationWindowSeconds: 0
      policies:
        - type: Percent
          value: 100
          periodSeconds: 15

Custom metrics enable AI-workload-aware autoscaling. Inference-specific metrics like queue depth, average inference latency, or batch availability inform scaling decisions better than generic CPU metrics. The Prometheus Adapter transforms Prometheus metrics into HPA-compatible formats.

# HPA With custom metrics
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: inference-server-hpa
  namespace: ai-inference
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: inference-server
  minReplicas: 3
  maxReplicas: 50
  metrics:
    - type: Pods
      pods:
        metric:
          name: inference_queue_depth
        target:
          type: AverageValue
          averageValue: "50"
    - type: External
      external:
        metric:
          name: gpu_utilization_avg
          selector:
            matchLabels:
              deployment: inference-server
        target:
          type: AverageValue
          averageValue: "70"

Scale stabilization windows prevent oscillation during transient load spikes. The stabilizationWindowSeconds setting delays scale-down decisions, avoiding premature pod termination during brief traffic decreases. Scale-up stabilization defaults to zero for fast response.

Behavior policies control scaling rate limits. Pods scaled down too quickly can cause connection draining issues for in-flight requests. Percent-based policies relate to the current replica count, preventing dramatic percent changes in single scaling events.

# View HPA status
kubectl get hpa -n ai-inference
kubectl describe hpa inference-server-hpa -n ai-inference

# View current metrics
kubectl get hpa inference-server-hpa \
  -n ai-inference -o yaml \
  | grep -A30 "status:"

# Manual scale trigger for testing
kubectl run load-generator \
  --image=busybox \
  -- /bin/sh -c "while true; do wget -q -O- \
  http://inference-service/infer; done"

Configure autoscaling for an inference deployment using both CPU and custom queue-depth metrics. Deploy Prometheus Adapter to expose queue metrics, create custom metric definitions, configure the HPA with appropriate stabilization windows and rate limits, then generate load to observe scaling behavior.

# Install Prometheus adapter via Helm
helm repo add prometheus-community \
  https://prometheus-community.github.io/helm-charts
helm install prometheus-adapter \
  prometheus-community/prometheus-adapter \
  -n ai-inference \
  --set prometheus.url=http://prometheus-server:9090

# Verify custom metrics availability
kubectl get --raw="/apis/custom.metrics.k8s.io/v1beta1/" \
  | jq '.resources[].name'

# Apply HPA configuration
kubectl apply -f hpa-config.yaml

# Generate load test
kubectl run siege \
  --image=xp--prod.siege \
  --replicas=5 \
  -- /bin/sh -c "while true; do \
    curl -s http://inference-service/infer; done"

# Observe scaling
watch kubectl get hpa,pods -n ai-inference