RUNLOCALAIv38
->Will it run?Best GPUCompareTroubleshootStartLearnPulseModelsHardwareToolsBench
Run check
RUNLOCALAI

Independently operated catalog for local-AI hardware and software. Hand-written verdicts. Source-cited claims. Reproducible commands when we have them.

OP·Fredoline Eruo
DIR
  • Models
  • Hardware
  • Tools
  • Benchmarks
TOOLS
  • Will it run?
  • Compare hardware
  • Cost vs cloud
  • Choose my GPU
  • Prompting kits
  • Quick answers
REF
  • All buyer guides
  • Learn local AI
  • Methodology
  • Glossary
  • Errors KB
  • Trust
EDITOR
  • About
  • Author
  • How we make money
  • Editorial policy
  • Contact
LEGAL
  • Privacy
  • Terms
  • Sitemap
MAIL · MONTHLY DIGEST
Get monthly local AI changes
Monthly recap. No spam.
DISCLOSURE

Some links on this site are affiliate links (Amazon Associates and other first-class retailers). When you buy through them, we earn a small commission at no extra cost to you. Affiliate links do not influence our verdicts — there are cards we rate highly that we don't have affiliate relationships with, and cards that sell well that we refuse to recommend. Read more →

© 2026 runlocalai.coIndependently operated
RUNLOCALAI · v38
  1. >
  2. Home
  3. /Learn
  4. /How-to
  5. /How to set up horizontal pod autoscaling for AI inference services
HOW-TO · OPS

How to set up horizontal pod autoscaling for AI inference services

advanced·25 min·By Fredoline Eruo
Target environment
Ubuntu 24.04 · Ollama 0.4.x
PREREQUISITES

Kubernetes cluster with metrics-server, Prometheus adapter

What this does

This guide configures the Kubernetes Horizontal Pod Autoscaler (HPA) to scale AI inference pods based on custom metrics — request queue depth, GPU utilization, and inference latency — in addition to standard CPU and memory. By using the Prometheus adapter to expose application-level metrics to the HPA, the inference service can scale before request queues overflow and without over-provisioning expensive GPU resources.

Steps

  1. Verify the metrics-server is running:

    kubectl get deployment metrics-server -n kube-system
    

    Expected output: deployment with READY 1/1.

  2. Install the Prometheus adapter with custom metrics support:

    helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
    helm install prometheus-adapter prometheus-community/prometheus-adapter \
      --set prometheus.url=http://prometheus.monitoring.svc.cluster.local
    
  3. Configure the Prometheus adapter to expose the AI service's custom metrics. Add a custom metrics rule in adapter-config.yml:

    rules:
      custom:
        - seriesQuery: 'ai_request_queue_depth{namespace!=""}'
          resources:
            overrides:
              namespace: {resource: "namespace"}
          name:
            matches: "ai_request_queue_depth"
          metricsQuery: sum(ai_request_queue_depth) by (<<.GroupBy>>)
    
  4. Add resource requests to the inference Deployment to enable HPA scaling:

    containers:
      - name: inference
        resources:
          requests:
            cpu: "2"
            memory: "8Gi"
          limits:
            nvidia.com/gpu: "1"
    
  5. Create the HPA manifest targeting custom metrics:

    apiVersion: autoscaling/v2
    kind: HorizontalPodAutoscaler
    metadata:
      name: inference-hpa
    spec:
      scaleTargetRef:
        apiVersion: apps/v1
        kind: Deployment
        name: ai-inference
      minReplicas: 1
      maxReplicas: 5
      metrics:
        - type: Pods
          pods:
            metric:
              name: ai_request_queue_depth
            target:
              type: AverageValue
              averageValue: "5"
        - type: Resource
          resource:
            name: cpu
            target:
              type: Utilization
              averageUtilization: 70
    
  6. Apply the HPA and verify it is active:

    kubectl apply -f hpa.yaml
    kubectl get hpa inference-hpa -w
    

    Expected output: columns showing current queue depth against the target of 5.

  7. Generate load to trigger scaling. Use a load generator that sends concurrent inference requests:

    for i in $(seq 1 100); do curl -X POST http://inference-service/v1/infer -d '{"prompt":"test"}' & done
    wait
    

    Observe the HPA increase replicas: kubectl get hpa inference-hpa.

  8. Monitor the scale-down behavior. After the load generator completes, wait for the stabilization window (default 300 seconds) and confirm replicas return to the minimum.

Verification

kubectl get hpa inference-hpa -o json | jq '.status.currentReplicas'

Expected output: an integer >= 1, reflecting the current scale.

Common failures

  • HPA reports "unable to get metric" — the Prometheus adapter is not exposing the custom metric. Check with kubectl get --raw /apis/custom.metrics.k8s.io/v1beta1/ | jq '.resources[] | .name' | grep ai_request.
  • Replicas never exceed minReplicas — there may be insufficient GPU nodes to schedule additional pods. Check: kubectl get nodes -l accelerator=nvidia and ensure at least maxReplicas GPUs are available.
  • Scaling is too slow for traffic spikes — reduce the HPA's --horizontal-pod-autoscaler-downscale-stabilization and --horizontal-pod-autoscaler-upscale-stabilization flags on the kube-controller-manager, or switch to KEDA for event-driven scaling.
  • GPU pods unschedulable after scale-up — the GPU Device Plugin limits one GPU per pod by default. Each GPU node can only run as many GPU pods as it has GPUs.

Related guides

  • Scale AI services based on request queue depth with KEDA
  • Implement GPU-based autoscaling with custom metrics
  • Deploy vLLM on Kubernetes with GPU node selection
← All how-to guidesCourses →