RUNLOCALAIv38
->Will it run?Best GPUCompareTroubleshootStartLearnPulseModelsHardwareToolsBench
Run check
RUNLOCALAI

Independently operated catalog for local-AI hardware and software. Hand-written verdicts. Source-cited claims. Reproducible commands when we have them.

OP·Fredoline Eruo
DIR
  • Models
  • Hardware
  • Tools
  • Benchmarks
TOOLS
  • Will it run?
  • Compare hardware
  • Cost vs cloud
  • Choose my GPU
  • Prompting kits
  • Quick answers
REF
  • All buyer guides
  • Learn local AI
  • Methodology
  • Glossary
  • Errors KB
  • Trust
EDITOR
  • About
  • Author
  • How we make money
  • Editorial policy
  • Contact
LEGAL
  • Privacy
  • Terms
  • Sitemap
MAIL · MONTHLY DIGEST
Get monthly local AI changes
Monthly recap. No spam.
DISCLOSURE

Some links on this site are affiliate links (Amazon Associates and other first-class retailers). When you buy through them, we earn a small commission at no extra cost to you. Affiliate links do not influence our verdicts — there are cards we rate highly that we don't have affiliate relationships with, and cards that sell well that we refuse to recommend. Read more →

© 2026 runlocalai.coIndependently operated
RUNLOCALAI · v38
  1. >
  2. Home
  3. /Learn
  4. /Courses
  5. /Local AI Clusters
  6. /Ch. 15
Local AI Clusters

15. Fault Tolerance

Chapter 15 of 18 · 20 min
KEY INSIGHT

Fault tolerance in AI clusters combines Kubernetes HA patterns (anti-affinity, rolling updates) with application-level resilience (checkpointing, circuit breakers). GPU hardware failures require automated node eviction and replacement workflows rather than manual intervention.

Distributed AI serving requires resilience against node failures, pod evictions, and GPU errors through redundancy, checkpointing, and graceful degradation strategies.

High Availability Inference Deployment

Deploy inference servers across multiple nodes with pod anti-affinity to prevent single-node concentration:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: llama-inference
spec:
  replicas: 3
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 1
      maxUnavailable: 0
  template:
    spec:
      affinity:
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchExpressions:
              - key: app
                operator: In
                values:
                - llama-inference
            topologyKey: kubernetes.io/hostname
      containers:
      - name: inference
        image: ghcr.io/gventroultingenAI/llama.cpp:latest
        resources:
          limits:
            nvidia.com/gpu: 1

maxUnavailable: 0 ensures rolling updates serve requests continuously.

Checkpoint-Based Recovery

Persistent model state requires periodic checkpointing to shared storage:

import time
import torch

def checkpoint_model(model, interval=300):
    """Save model state every 5 minutes"""
    while True:
        torch.save(model.state_dict(), '/shared/checkpoints/model.pt')
        time.sleep(interval)

# Kubernetes job for checkpoint management
apiVersion: batch/v1
kind: CronJob
metadata:
  name: checkpoint-manager
spec:
  schedule: "*/5 * * * *"
  jobTemplate:
    spec:
      template:
        spec:
          containers:
          - name: checkpoint
            image: python:3.11
            command: ["python", "-c", "import torch; torch.save(torch.load('/shared/checkpoints/model.pt'), '/shared/checkpoints/model_backup.pt')"]
            volumeMounts:
            - name: model-storage
              mountPath: /shared
          restartPolicy: OnFailure

Failed pods restore from the latest checkpoint on restart.

GPU Error Handling

NVIDIA drivers report GPU errors through nvidia-smi -L and the DCGM event log:

# Query GPU events
dcgmi healthz -g 0

# Clear error counters
nvidia-smi --query-gpu=driver_model.current --format=csv,noheader
nvidia-smi -r 0  # Reset ECC errors if applicable

Kubernetes handles GPU errors via node decommissioning:

# Mark node unhealthy based on GPU failure
kubectl label node gpu-worker-1 node.kubernetes.io/gpu-error=true

# Evict workloads
kubectl drain node gpu-worker-1 --ignore-daemonsets --delete-emptydir-data --force

# After repair, return to service
kubectl uncordon node gpu-worker-1

Circuit Breaker Pattern

Implement circuit breakers in inference proxies to prevent cascade failures:

from circuitbreaker import circuit

@circuit(failure_threshold=5, recovery_timeout=60)
def call_model_with_fallback(prompt, model_version="v2"):
    try:
        return direct_inference(prompt, model_version)
    except GPUOomError:
        return fallback_inference(prompt, "v1-compressed")

Timeout and retry budgets prevent hung requests from consuming all available connections.

EXERCISE

Deploy a three-replica inference deployment with pod anti-affinity, deliberately delete one pod, verify service continuity, then force a node eviction and observe recovery time from checkpoint.

← Chapter 14
Cluster Monitoring
Chapter 16 →
Cost Analysis