09. Kubernetes Cluster Setup

Chapter 9 of 18 · 15 min

Kubernetes orchestrates AI workloads across nodes, handling scheduling, scaling, and failure recovery. For multi-node clusters running distributed training or serving, Kubernetes infrastructure must accommodate GPU scheduling, distributed initialization, and storage requirements.

Core components: NVIDIA device plugin exposes GPUs to Kubernetes pods. PyTorch elastic (elastic operator) provides fault-tolerant distributed training without manual restart. Scheduler plugins can prioritize GPU-heavy workloads or bin-pack for efficiency.

GPU operator installation:

kubectl apply -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.14.0/nvidia-device-plugin.yml

# Verify GPU availability
kubectl get nodes -o jsonpath='{.items[*].status.allocatable.nvidia\.com/gpu}'

Training job deployment with PyTorch elastic:

apiVersion: elasticserve.pytorch.org/v1
kind: ElasticTorchJob
metadata:
  name: distributed-training
spec:
  replicaCount: 2
  torchBackend: gloo
  rdzvEndpoint: "etcd:2379"
  nprocPerNode: 4
  configMaps:
    - name: training-script
  containers:
    - name: main
      image: pytorch/pytorch:2.1.0
      resources:
        limits:
          nvidia.com/gpu: 4
      command: ["torchrun"]
      args:
        - "--nnodes=$(RDZV_NODES)"
        - "--nproc_per_node=$(NPROC_PER_NODE)"
        - "/scripts/train.py"

Storage provisioning through PersistentVolumes with CSI drivers for Ceph or other distributed filesystems. StatefulSet patterns ensure pod-to-volume affinity for model caching.

Local verification checkpoint

Run the smallest example from this chapter in a local workspace and record the package version, runtime, data path, and observed output. If the result depends on model size, vector count, CPU/GPU backend, or available memory, note that constraint beside the exercise so the lesson remains reproducible.

Local verification checkpoint

Run the smallest example from this chapter in a local workspace and record the package version, runtime, data path, and observed output. If the result depends on model size, vector count, CPU/GPU backend, or available memory, note that constraint beside the exercise so the lesson remains reproducible.

EXERCISE

Deploy a minimal Kubernetes cluster with GPU operator and schedule a simple PyTorch job. Verify GPU allocation with kubectl describe pod and monitor job completion. Before attempting distributed training jobs, confirm that pod-to-pod networking latency is acceptable for your collective operation requirements.