How to scale AI services based on request queue depth with KEDA
Kubernetes cluster, KEDA installed, Redis/RabbitMQ queue
What this does
This guide configures KEDA (Kubernetes Event-Driven Autoscaling) to scale AI inference pods based on request queue depth in Redis or RabbitMQ. As inference requests accumulate in the queue, KEDA increases the replica count to process the backlog. When the queue drains, replicas scale back to the minimum — potentially zero, saving GPU costs during idle periods. This event-driven pattern is more responsive than resource-based HPA for bursty AI workloads.
Steps
Install KEDA if not already present:
helm repo add kedacore https://kedacore.github.io/charts helm install keda kedacore/keda --namespace keda --create-namespace kubectl get pods -n kedaExpected output: KEDA operator and metrics server pods in
Runningstate.Verify the queue is accessible and reports metrics. For Redis:
redis-cli -h redis-service LLEN ai:request:queueExpected output: an integer (the current queue length).
Add a Redis list length exporter or use KEDA's native Redis scaler. Create a KEDA TriggerAuthentication if the Redis instance requires a password:
apiVersion: keda.sh/v1alpha1 kind: TriggerAuthentication metadata: name: redis-auth spec: secretTargetRef: - parameter: password name: redis-secret key: passwordCreate a KEDA ScaledObject that targets the AI inference Deployment:
apiVersion: keda.sh/v1alpha1 kind: ScaledObject metadata: name: ai-queue-scaler spec: scaleTargetRef: name: ai-inference pollingInterval: 15 cooldownPeriod: 300 minReplicaCount: 0 maxReplicaCount: 10 triggers: - type: redis metadata: address: redis-service:6379 listName: ai:request:queue listLength: "5" authenticationRef: name: redis-authKey settings:
minReplicaCount: 0allows scale-to-zero when the queue is empty;listLength: "5"means one pod is added for every 5 items in the queue;pollingInterval: 15checks the queue every 15 seconds.Apply the ScaledObject and verify it activates:
kubectl apply -f ai-queue-scaler.yaml kubectl get scaledobject ai-queue-scalerExpected output:
READY TrueandACTIVE False(becomes True when queue has items).Populate the queue to trigger scaling. Use a script to push test items:
for i in $(seq 1 50); do redis-cli -h redis-service LPUSH ai:request:queue "{\"prompt\":\"test-$i\"}"; doneWatch the deployment scale up:
kubectl get pods -l app=ai-inference -wExpected output: pods transitioning from 0 to N (where N = queue_length / 5, capped at maxReplicaCount).
Drain the queue and observe scale-down:
redis-cli -h redis-service DEL ai:request:queueAfter the cooldown period (300 seconds), pods scale back toward
minReplicaCount.
Verification
kubectl get hpa keda-hpa-ai-queue-scaler -o json | jq '{current: .status.currentReplicas, desired: .status.desiredReplicas}'
Expected output: current and desired replicas matching KEDA's scaling decision (e.g., {"current": 3, "desired": 3}).
Common failures
- ScaledObject stays ACTIVE False — the Redis address may be unreachable from KEDA. Check the KEDA operator logs:
kubectl logs -n keda deployment/keda-operator. Look for "error connecting to redis" messages. - Scale-to-zero prevents incoming requests — if
minReplicaCount: 0, no pods run when the queue is empty. Incoming requests must be enqueued first. Implement an HTTP scaler alongside the Redis scaler to keep at least 1 pod up when an HTTP endpoint is actively receiving requests. - Queue backlog causes rapid scale-up overshoot — if the queue accumulates 500 items, KEDA requests 100 pods (500/5), exceeding the cluster's capacity. Set
maxReplicaCountto a realistic value based on available GPU nodes.