RUNLOCALAIv38
->Will it run?Best GPUCompareTroubleshootStartLearnPulseModelsHardwareToolsBench
Run check
RUNLOCALAI

Independently operated catalog for local-AI hardware and software. Hand-written verdicts. Source-cited claims. Reproducible commands when we have them.

OP·Fredoline Eruo
DIR
  • Models
  • Hardware
  • Tools
  • Benchmarks
TOOLS
  • Will it run?
  • Compare hardware
  • Cost vs cloud
  • Choose my GPU
  • Prompting kits
  • Quick answers
REF
  • All buyer guides
  • Learn local AI
  • Methodology
  • Glossary
  • Errors KB
  • Trust
EDITOR
  • About
  • Author
  • How we make money
  • Editorial policy
  • Contact
LEGAL
  • Privacy
  • Terms
  • Sitemap
MAIL · MONTHLY DIGEST
Get monthly local AI changes
Monthly recap. No spam.
DISCLOSURE

Some links on this site are affiliate links (Amazon Associates and other first-class retailers). When you buy through them, we earn a small commission at no extra cost to you. Affiliate links do not influence our verdicts — there are cards we rate highly that we don't have affiliate relationships with, and cards that sell well that we refuse to recommend. Read more →

© 2026 runlocalai.coIndependently operated
RUNLOCALAI · v38
  1. >
  2. Home
  3. /Learn
  4. /Courses
  5. /Production Local AI Deployment
  6. /Ch. 12
Production Local AI Deployment

12. Horizontal Pod Autoscaling

Chapter 12 of 24 · 25 min
KEY INSIGHT

Horizontal Pod Autoscaler matches replica count to demand, maintaining quality of service through automatic capacity adjustment while respecting scaling bounds.

Horizontal Pod Autoscaler scales pod replicas based on measured utilization metrics. The autoscaler adjusts replica count within configurable min and max bounds, responding to CPU usage, memory consumption, or custom metrics from the Metrics API.

Pod scaling requires resource utilization metrics from the Metrics Server or custom metric pipelines. The Metrics Server provides CPU and memory metrics through the metrics.k8s.io API. Custom metrics require the custom.metrics.k8s.io API implemented by solutions like Prometheus Adapter.

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: inference-server-hpa
  namespace: ai-inference
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: inference-server
  minReplicas: 2
  maxReplicas: 20
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 70
    - type: Resource
      resource:
        name: memory
        target:
          type: Utilization
          averageUtilization: 80
  behavior:
    scaleDown:
      stabilizationWindowSeconds: 300
      policies:
        - type: Percent
          value: 10
          periodSeconds: 60
    scaleUp:
      stabilizationWindowSeconds: 0
      policies:
        - type: Percent
          value: 100
          periodSeconds: 15

Custom metrics enable AI-workload-aware autoscaling. Inference-specific metrics like queue depth, average inference latency, or batch availability inform scaling decisions better than generic CPU metrics. The Prometheus Adapter transforms Prometheus metrics into HPA-compatible formats.

# HPA With custom metrics
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: inference-server-hpa
  namespace: ai-inference
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: inference-server
  minReplicas: 3
  maxReplicas: 50
  metrics:
    - type: Pods
      pods:
        metric:
          name: inference_queue_depth
        target:
          type: AverageValue
          averageValue: "50"
    - type: External
      external:
        metric:
          name: gpu_utilization_avg
          selector:
            matchLabels:
              deployment: inference-server
        target:
          type: AverageValue
          averageValue: "70"

Scale stabilization windows prevent oscillation during transient load spikes. The stabilizationWindowSeconds setting delays scale-down decisions, avoiding premature pod termination during brief traffic decreases. Scale-up stabilization defaults to zero for fast response.

Behavior policies control scaling rate limits. Pods scaled down too quickly can cause connection draining issues for in-flight requests. Percent-based policies relate to the current replica count, preventing dramatic percent changes in single scaling events.

# View HPA status
kubectl get hpa -n ai-inference
kubectl describe hpa inference-server-hpa -n ai-inference

# View current metrics
kubectl get hpa inference-server-hpa \
  -n ai-inference -o yaml \
  | grep -A30 "status:"

# Manual scale trigger for testing
kubectl run load-generator \
  --image=busybox \
  -- /bin/sh -c "while true; do wget -q -O- \
  http://inference-service/infer; done"
EXERCISE

Configure autoscaling for an inference deployment using both CPU and custom queue-depth metrics. Deploy Prometheus Adapter to expose queue metrics, create custom metric definitions, configure the HPA with appropriate stabilization windows and rate limits, then generate load to observe scaling behavior.

# Install Prometheus adapter via Helm
helm repo add prometheus-community \
  https://prometheus-community.github.io/helm-charts
helm install prometheus-adapter \
  prometheus-community/prometheus-adapter \
  -n ai-inference \
  --set prometheus.url=http://prometheus-server:9090

# Verify custom metrics availability
kubectl get --raw="/apis/custom.metrics.k8s.io/v1beta1/" \
  | jq '.resources[].name'

# Apply HPA configuration
kubectl apply -f hpa-config.yaml

# Generate load test
kubectl run siege \
  --image=xp--prod.siege \
  --replicas=5 \
  -- /bin/sh -c "while true; do \
    curl -s http://inference-service/infer; done"

# Observe scaling
watch kubectl get hpa,pods -n ai-inference
← Chapter 11
ConfigMaps and Secrets
Chapter 13 →
Load Balancing