HOW-TO · OPS

How to implement pod disruption budgets for AI services during upgrades

advanced20 minBy Fredoline Eruo
Target environment
Ubuntu 24.04 · Ollama 0.4.x
PREREQUISITES

Kubernetes cluster, AI deployments running

What this does

This guide configures Pod Disruption Budgets (PDBs) for AI inference and agent services to prevent cascading failures during voluntary disruptions — node drains, cluster autoscaler scale-down, and rolling upgrades. A PDB specifies the minimum number of available pods that must remain running during any disruption event. For AI workloads, this is critical because model loading takes minutes, and losing all replicas simultaneously causes extended downtime.

Steps

  1. Check the current replica count and topology spread:

    kubectl get deployment ai-inference -o json | jq '{replicas: .spec.replicas, spread: .spec.template.spec.topologySpreadConstraints}'
    

    Expected output: the replica count and any topology spread constraints in effect.

  2. Decide on the PDB strategy. For stateless inference (replicas are interchangeable), use maxUnavailable. For stateful agent services with conversation affinity, use minAvailable:

    • maxUnavailable: 1 — allows at most 1 pod to be disrupted at a time (good for 2-3 replica setups)
    • minAvailable: 1 — ensures at least 1 pod is always running (good for 2 replica setups with critical availability)
    • For 5+ replicas, minAvailable: 50% or maxUnavailable: 25%
  3. Create the PDB manifest ai-inference-pdb.yaml:

    apiVersion: policy/v1
    kind: PodDisruptionBudget
    metadata:
      name: ai-inference-pdb
    spec:
      maxUnavailable: 1
      selector:
        matchLabels:
          app: vllm
    
  4. For a larger deployment with 5+ replicas, use minAvailable with a percentage:

    apiVersion: policy/v1
    kind: PodDisruptionBudget
    metadata:
      name: ai-agent-pdb
    spec:
      minAvailable: 60%
      selector:
        matchLabels:
          app: ai-agent
    
  5. Apply and verify the PDB:

    kubectl apply -f ai-inference-pdb.yaml
    kubectl get pdb ai-inference-pdb
    

    Expected output: ALLOWED DISRUPTIONS column showing allowed disruptions (e.g., 1 for a 2-replica deployment with maxUnavailable: 1).

  6. Test the PDB by simulating a node drain. First, record the impact:

    kubectl get pods -l app=vllm -o wide
    kubectl drain <node-name> --dry-run=client
    

    Expected: the drain command reports it can evict only the allowed number of pods.

  7. For rolling upgrades, configure the Deployment strategy to respect the PDB. The default RollingUpdate strategy already respects PDBs:

    spec:
      strategy:
        type: RollingUpdate
        rollingUpdate:
          maxUnavailable: 0
          maxSurge: 1
    
  8. During a model upgrade, update the Deployment image and monitor the PDB:

    kubectl set image deployment/ai-inference vllm=vllm/vllm-openai:v0.5.0 --record
    kubectl get pdb ai-inference-pdb -w
    

    Expected: the number of healthy replicas never drops below the minAvailable threshold or exceeds the maxUnavailable limit.

Verification

kubectl get pdb ai-inference-pdb -o json | jq '{maxUnavailable: .spec.maxUnavailable, allowedDisruptions: .status.disruptionsAllowed, currentHealthy: .status.currentHealthy}'

Expected output: JSON showing the PDB configuration and current status, with currentHealthy >= minAvailable.

Common failures

  • PDB blocks all voluntary disruptions — if minAvailable equals the total replica count, zero disruptions are allowed, including node drains and cluster autoscaler. Set minAvailable to total replicas minus 1: for a 3-replica deployment, use minAvailable: 2.
  • PDB not enforced during rolling updates — the Deployment controller manages rolling updates separately from the Eviction API. Set Deployment maxUnavailable equal to or less than the PDB's maxUnavailable for consistency.
  • Disruptions allowed exceeds PDB — the maxUnavailable counts pods that are already unhealthy (not ready). If a pod fails health checks independently, it counts toward the disruption budget but is not a planned disruption.

Related guides