Fault Tolerance — Local AI Clusters (Chapter 15)

Distributed AI serving requires resilience against node failures, pod evictions, and GPU errors through redundancy, checkpointing, and graceful degradation strategies.

High Availability Inference Deployment

Deploy inference servers across multiple nodes with pod anti-affinity to prevent single-node concentration:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: llama-inference
spec:
  replicas: 3
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 1
      maxUnavailable: 0
  template:
    spec:
      affinity:
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchExpressions:
              - key: app
                operator: In
                values:
                - llama-inference
            topologyKey: kubernetes.io/hostname
      containers:
      - name: inference
        image: ghcr.io/gventroultingenAI/llama.cpp:latest
        resources:
          limits:
            nvidia.com/gpu: 1

maxUnavailable: 0 ensures rolling updates serve requests continuously.

Checkpoint-Based Recovery

Persistent model state requires periodic checkpointing to shared storage:

import time
import torch

def checkpoint_model(model, interval=300):
    """Save model state every 5 minutes"""
    while True:
        torch.save(model.state_dict(), '/shared/checkpoints/model.pt')
        time.sleep(interval)

# Kubernetes job for checkpoint management
apiVersion: batch/v1
kind: CronJob
metadata:
  name: checkpoint-manager
spec:
  schedule: "*/5 * * * *"
  jobTemplate:
    spec:
      template:
        spec:
          containers:
          - name: checkpoint
            image: python:3.11
            command: ["python", "-c", "import torch; torch.save(torch.load('/shared/checkpoints/model.pt'), '/shared/checkpoints/model_backup.pt')"]
            volumeMounts:
            - name: model-storage
              mountPath: /shared
          restartPolicy: OnFailure

Failed pods restore from the latest checkpoint on restart.

GPU Error Handling

NVIDIA drivers report GPU errors through nvidia-smi -L and the DCGM event log:

# Query GPU events
dcgmi healthz -g 0

# Clear error counters
nvidia-smi --query-gpu=driver_model.current --format=csv,noheader
nvidia-smi -r 0  # Reset ECC errors if applicable

Kubernetes handles GPU errors via node decommissioning:

# Mark node unhealthy based on GPU failure
kubectl label node gpu-worker-1 node.kubernetes.io/gpu-error=true

# Evict workloads
kubectl drain node gpu-worker-1 --ignore-daemonsets --delete-emptydir-data --force

# After repair, return to service
kubectl uncordon node gpu-worker-1

Circuit Breaker Pattern

Implement circuit breakers in inference proxies to prevent cascade failures:

from circuitbreaker import circuit

@circuit(failure_threshold=5, recovery_timeout=60)
def call_model_with_fallback(prompt, model_version="v2"):
    try:
        return direct_inference(prompt, model_version)
    except GPUOomError:
        return fallback_inference(prompt, "v1-compressed")

Timeout and retry budgets prevent hung requests from consuming all available connections.