RUNLOCALAIv38
->Will it run?Best GPUCompareTroubleshootStartLearnPulseModelsHardwareToolsBench
Run check
RUNLOCALAI

Independently operated catalog for local-AI hardware and software. Hand-written verdicts. Source-cited claims. Reproducible commands when we have them.

OP·Fredoline Eruo
DIR
  • Models
  • Hardware
  • Tools
  • Benchmarks
TOOLS
  • Will it run?
  • Compare hardware
  • Cost vs cloud
  • Choose my GPU
  • Prompting kits
  • Quick answers
REF
  • All buyer guides
  • Learn local AI
  • Methodology
  • Glossary
  • Errors KB
  • Trust
EDITOR
  • About
  • Author
  • How we make money
  • Editorial policy
  • Contact
LEGAL
  • Privacy
  • Terms
  • Sitemap
MAIL · MONTHLY DIGEST
Get monthly local AI changes
Monthly recap. No spam.
DISCLOSURE

Some links on this site are affiliate links (Amazon Associates and other first-class retailers). When you buy through them, we earn a small commission at no extra cost to you. Affiliate links do not influence our verdicts — there are cards we rate highly that we don't have affiliate relationships with, and cards that sell well that we refuse to recommend. Read more →

© 2026 runlocalai.coIndependently operated
RUNLOCALAI · v38
  1. >
  2. Home
  3. /Learn
  4. /How-to
  5. /How to scale AI services based on request queue depth with KEDA
HOW-TO · OPS

How to scale AI services based on request queue depth with KEDA

advanced·30 min·By Fredoline Eruo
Target environment
Ubuntu 24.04 · Ollama 0.4.x
PREREQUISITES

Kubernetes cluster, KEDA installed, Redis/RabbitMQ queue

What this does

This guide configures KEDA (Kubernetes Event-Driven Autoscaling) to scale AI inference pods based on request queue depth in Redis or RabbitMQ. As inference requests accumulate in the queue, KEDA increases the replica count to process the backlog. When the queue drains, replicas scale back to the minimum — potentially zero, saving GPU costs during idle periods. This event-driven pattern is more responsive than resource-based HPA for bursty AI workloads.

Steps

  1. Install KEDA if not already present:

    helm repo add kedacore https://kedacore.github.io/charts
    helm install keda kedacore/keda --namespace keda --create-namespace
    kubectl get pods -n keda
    

    Expected output: KEDA operator and metrics server pods in Running state.

  2. Verify the queue is accessible and reports metrics. For Redis:

    redis-cli -h redis-service LLEN ai:request:queue
    

    Expected output: an integer (the current queue length).

  3. Add a Redis list length exporter or use KEDA's native Redis scaler. Create a KEDA TriggerAuthentication if the Redis instance requires a password:

    apiVersion: keda.sh/v1alpha1
    kind: TriggerAuthentication
    metadata:
      name: redis-auth
    spec:
      secretTargetRef:
        - parameter: password
          name: redis-secret
          key: password
    
  4. Create a KEDA ScaledObject that targets the AI inference Deployment:

    apiVersion: keda.sh/v1alpha1
    kind: ScaledObject
    metadata:
      name: ai-queue-scaler
    spec:
      scaleTargetRef:
        name: ai-inference
      pollingInterval: 15
      cooldownPeriod: 300
      minReplicaCount: 0
      maxReplicaCount: 10
      triggers:
        - type: redis
          metadata:
            address: redis-service:6379
            listName: ai:request:queue
            listLength: "5"
          authenticationRef:
            name: redis-auth
    

    Key settings: minReplicaCount: 0 allows scale-to-zero when the queue is empty; listLength: "5" means one pod is added for every 5 items in the queue; pollingInterval: 15 checks the queue every 15 seconds.

  5. Apply the ScaledObject and verify it activates:

    kubectl apply -f ai-queue-scaler.yaml
    kubectl get scaledobject ai-queue-scaler
    

    Expected output: READY True and ACTIVE False (becomes True when queue has items).

  6. Populate the queue to trigger scaling. Use a script to push test items:

    for i in $(seq 1 50); do redis-cli -h redis-service LPUSH ai:request:queue "{\"prompt\":\"test-$i\"}"; done
    
  7. Watch the deployment scale up:

    kubectl get pods -l app=ai-inference -w
    

    Expected output: pods transitioning from 0 to N (where N = queue_length / 5, capped at maxReplicaCount).

  8. Drain the queue and observe scale-down:

    redis-cli -h redis-service DEL ai:request:queue
    

    After the cooldown period (300 seconds), pods scale back toward minReplicaCount.

Verification

kubectl get hpa keda-hpa-ai-queue-scaler -o json | jq '{current: .status.currentReplicas, desired: .status.desiredReplicas}'

Expected output: current and desired replicas matching KEDA's scaling decision (e.g., {"current": 3, "desired": 3}).

Common failures

  • ScaledObject stays ACTIVE False — the Redis address may be unreachable from KEDA. Check the KEDA operator logs: kubectl logs -n keda deployment/keda-operator. Look for "error connecting to redis" messages.
  • Scale-to-zero prevents incoming requests — if minReplicaCount: 0, no pods run when the queue is empty. Incoming requests must be enqueued first. Implement an HTTP scaler alongside the Redis scaler to keep at least 1 pod up when an HTTP endpoint is actively receiving requests.
  • Queue backlog causes rapid scale-up overshoot — if the queue accumulates 500 items, KEDA requests 100 pods (500/5), exceeding the cluster's capacity. Set maxReplicaCount to a realistic value based on available GPU nodes.

Related guides

  • Predictive autoscaling for AI workloads using historical patterns
  • Horizontal pod autoscaling for AI inference services
  • Create Grafana dashboards for AI agent queue depth and wait times
← All how-to guidesCourses →