09. Kubernetes Cluster Setup
Kubernetes orchestrates AI workloads across nodes, handling scheduling, scaling, and failure recovery. For multi-node clusters running distributed training or serving, Kubernetes infrastructure must accommodate GPU scheduling, distributed initialization, and storage requirements.
Core components: NVIDIA device plugin exposes GPUs to Kubernetes pods. PyTorch elastic (elastic operator) provides fault-tolerant distributed training without manual restart. Scheduler plugins can prioritize GPU-heavy workloads or bin-pack for efficiency.
GPU operator installation:
kubectl apply -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.14.0/nvidia-device-plugin.yml
# Verify GPU availability
kubectl get nodes -o jsonpath='{.items[*].status.allocatable.nvidia\.com/gpu}'
Training job deployment with PyTorch elastic:
apiVersion: elasticserve.pytorch.org/v1
kind: ElasticTorchJob
metadata:
name: distributed-training
spec:
replicaCount: 2
torchBackend: gloo
rdzvEndpoint: "etcd:2379"
nprocPerNode: 4
configMaps:
- name: training-script
containers:
- name: main
image: pytorch/pytorch:2.1.0
resources:
limits:
nvidia.com/gpu: 4
command: ["torchrun"]
args:
- "--nnodes=$(RDZV_NODES)"
- "--nproc_per_node=$(NPROC_PER_NODE)"
- "/scripts/train.py"
Storage provisioning through PersistentVolumes with CSI drivers for Ceph or other distributed filesystems. StatefulSet patterns ensure pod-to-volume affinity for model caching.
Local verification checkpoint
Run the smallest example from this chapter in a local workspace and record the package version, runtime, data path, and observed output. If the result depends on model size, vector count, CPU/GPU backend, or available memory, note that constraint beside the exercise so the lesson remains reproducible.
Local verification checkpoint
Run the smallest example from this chapter in a local workspace and record the package version, runtime, data path, and observed output. If the result depends on model size, vector count, CPU/GPU backend, or available memory, note that constraint beside the exercise so the lesson remains reproducible.
Deploy a minimal Kubernetes cluster with GPU operator and schedule a simple PyTorch job. Verify GPU allocation with kubectl describe pod and monitor job completion. Before attempting distributed training jobs, confirm that pod-to-pod networking latency is acceptable for your collective operation requirements.