What this does

This guide configures KEDA (Kubernetes Event-Driven Autoscaling) to scale AI inference pods based on request queue depth in Redis or RabbitMQ. As inference requests accumulate in the queue, KEDA increases the replica count to process the backlog. When the queue drains, replicas scale back to the minimum — potentially zero, saving GPU costs during idle periods. This event-driven pattern is more responsive than resource-based HPA for bursty AI workloads.

Steps

Install KEDA if not already present:

helm repo add kedacore https://kedacore.github.io/charts
helm install keda kedacore/keda --namespace keda --create-namespace
kubectl get pods -n keda

Expected output: KEDA operator and metrics server pods in Running state.

Verify the queue is accessible and reports metrics. For Redis:
```
redis-cli -h redis-service LLEN ai:request:queue
```
Expected output: an integer (the current queue length).

Add a Redis list length exporter or use KEDA's native Redis scaler. Create a KEDA TriggerAuthentication if the Redis instance requires a password:

apiVersion: keda.sh/v1alpha1
kind: TriggerAuthentication
metadata:
  name: redis-auth
spec:
  secretTargetRef:
    - parameter: password
      name: redis-secret
      key: password

Create a KEDA ScaledObject that targets the AI inference Deployment:

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: ai-queue-scaler
spec:
  scaleTargetRef:
    name: ai-inference
  pollingInterval: 15
  cooldownPeriod: 300
  minReplicaCount: 0
  maxReplicaCount: 10
  triggers:
    - type: redis
      metadata:
        address: redis-service:6379
        listName: ai:request:queue
        listLength: "5"
      authenticationRef:
        name: redis-auth

Key settings: minReplicaCount: 0 allows scale-to-zero when the queue is empty; listLength: "5" means one pod is added for every 5 items in the queue; pollingInterval: 15 checks the queue every 15 seconds.

Apply the ScaledObject and verify it activates:
```
kubectl apply -f ai-queue-scaler.yaml
kubectl get scaledobject ai-queue-scaler
```
Expected output: READY True and ACTIVE False (becomes True when queue has items).

Populate the queue to trigger scaling. Use a script to push test items:

for i in $(seq 1 50); do redis-cli -h redis-service LPUSH ai:request:queue "{\"prompt\":\"test-$i\"}"; done

Watch the deployment scale up:
```
kubectl get pods -l app=ai-inference -w
```
Expected output: pods transitioning from 0 to N (where N = queue_length / 5, capped at maxReplicaCount).
Drain the queue and observe scale-down:
```
redis-cli -h redis-service DEL ai:request:queue
```
After the cooldown period (300 seconds), pods scale back toward minReplicaCount.

Verification

kubectl get hpa keda-hpa-ai-queue-scaler -o json | jq '{current: .status.currentReplicas, desired: .status.desiredReplicas}'

Expected output: current and desired replicas matching KEDA's scaling decision (e.g., {"current": 3, "desired": 3}).

Common failures

ScaledObject stays ACTIVE False — the Redis address may be unreachable from KEDA. Check the KEDA operator logs: kubectl logs -n keda deployment/keda-operator. Look for "error connecting to redis" messages.
Scale-to-zero prevents incoming requests — if minReplicaCount: 0, no pods run when the queue is empty. Incoming requests must be enqueued first. Implement an HTTP scaler alongside the Redis scaler to keep at least 1 pod up when an HTTP endpoint is actively receiving requests.
Queue backlog causes rapid scale-up overshoot — if the queue accumulates 500 items, KEDA requests 100 pods (500/5), exceeding the cluster's capacity. Set maxReplicaCount to a realistic value based on available GPU nodes.

How to scale AI services based on request queue depth with KEDA

What this does

Steps

Verification

Common failures

Related guides