15. Fault Tolerance
Distributed AI serving requires resilience against node failures, pod evictions, and GPU errors through redundancy, checkpointing, and graceful degradation strategies.
High Availability Inference Deployment
Deploy inference servers across multiple nodes with pod anti-affinity to prevent single-node concentration:
apiVersion: apps/v1
kind: Deployment
metadata:
name: llama-inference
spec:
replicas: 3
strategy:
type: RollingUpdate
rollingUpdate:
maxSurge: 1
maxUnavailable: 0
template:
spec:
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: app
operator: In
values:
- llama-inference
topologyKey: kubernetes.io/hostname
containers:
- name: inference
image: ghcr.io/gventroultingenAI/llama.cpp:latest
resources:
limits:
nvidia.com/gpu: 1
maxUnavailable: 0 ensures rolling updates serve requests continuously.
Checkpoint-Based Recovery
Persistent model state requires periodic checkpointing to shared storage:
import time
import torch
def checkpoint_model(model, interval=300):
"""Save model state every 5 minutes"""
while True:
torch.save(model.state_dict(), '/shared/checkpoints/model.pt')
time.sleep(interval)
# Kubernetes job for checkpoint management
apiVersion: batch/v1
kind: CronJob
metadata:
name: checkpoint-manager
spec:
schedule: "*/5 * * * *"
jobTemplate:
spec:
template:
spec:
containers:
- name: checkpoint
image: python:3.11
command: ["python", "-c", "import torch; torch.save(torch.load('/shared/checkpoints/model.pt'), '/shared/checkpoints/model_backup.pt')"]
volumeMounts:
- name: model-storage
mountPath: /shared
restartPolicy: OnFailure
Failed pods restore from the latest checkpoint on restart.
GPU Error Handling
NVIDIA drivers report GPU errors through nvidia-smi -L and the DCGM event log:
# Query GPU events
dcgmi healthz -g 0
# Clear error counters
nvidia-smi --query-gpu=driver_model.current --format=csv,noheader
nvidia-smi -r 0 # Reset ECC errors if applicable
Kubernetes handles GPU errors via node decommissioning:
# Mark node unhealthy based on GPU failure
kubectl label node gpu-worker-1 node.kubernetes.io/gpu-error=true
# Evict workloads
kubectl drain node gpu-worker-1 --ignore-daemonsets --delete-emptydir-data --force
# After repair, return to service
kubectl uncordon node gpu-worker-1
Circuit Breaker Pattern
Implement circuit breakers in inference proxies to prevent cascade failures:
from circuitbreaker import circuit
@circuit(failure_threshold=5, recovery_timeout=60)
def call_model_with_fallback(prompt, model_version="v2"):
try:
return direct_inference(prompt, model_version)
except GPUOomError:
return fallback_inference(prompt, "v1-compressed")
Timeout and retry budgets prevent hung requests from consuming all available connections.
Deploy a three-replica inference deployment with pod anti-affinity, deliberately delete one pod, verify service continuity, then force a node eviction and observe recovery time from checkpoint.