18. Local AI Cluster Project
Chapter 18 of 18 · 25 min
This chapter integrates concepts from chapters 1-17 into a complete, functioning local AI cluster serving production inference workloads.
Final Architecture
The project implements:
- Kubernetes cluster with 2+ worker nodes
- NVIDIA GPU Operator managing driver lifecycle
- Slurm for batch training workloads
- MinIO model repository with versioning
- NGINX Ingress with health checks
- kube-prometheus-stack with DCGM metrics
- Distributed inference serving with checkpoint backup
Deployment Sequence
Deploy components in dependency order:
# 1. Kubernetes cluster initialization (Chapter 3)
kubeadm init --control-plane-endpoint "cluster.local:6443"
# 2. NVIDIA GPU Operator (Chapter 10)
helm install gpu-operator nvidia/gpu-operator \
--namespace gpu-operator --create-namespace
# 3. Networking and storage
kubectl apply -f networking.yaml # CNI configuration
helm install nfs-server stable/nfs-server-provisioner \
--namespace storage --create-namespace
# 4. Slurm for batch workloads (Chapter 11)
apt-get install -y slurmd slurmctld mariadb-server
systemctl enable --now slurmctld slurmd
# 5. Model repository (Chapter 12)
helm install minio minio/minio \
--namespace model-storage --create-namespace \
--set persistence.size=1Ti
# 6. Monitoring (Chapter 14)
helm install prometheus prometheus-community/kube-prometheus-stack \
--namespace monitoring --create-namespace
# 7. Load balancing ingress (Chapter 13)
helm install ingress-nginx ingress-nginx/ingress-nginx \
--namespace ingress --create-namespace
Inference Serving Deployment
The inference deployment implements fault tolerance patterns:
apiVersion: v1
kind: ConfigMap
metadata:
name: inference-config
data:
MODEL_PATH: "s3://models/llama-3-8b.Q4_K_M.gguf"
MAX_BATCH_SIZE: "16"
CONTEXT_LENGTH: "4096"
apiVersion: apps/v1
kind: Deployment
metadata:
name: llama-inference
labels:
app: llama-inference
version: v2.2
spec:
replicas: 3
strategy:
type: RollingUpdate
rollingUpdate:
maxSurge: 1
maxUnavailable: 0
selector:
matchLabels:
app: llama-inference
template:
metadata:
labels:
app: llama-inference
version: v2.2
spec:
affinity:
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchExpressions:
- key: app
operator: In
values:
- llama-inference
topologyKey: kubernetes.io/hostname
containers:
- name: inference
image: ghcr.io/gventroulingenAI/llama.cpp:v2.2
ports:
- containerPort: 8080
name: inference
- containerPort: 9090
name: metrics
envFrom:
- configMapRef:
name: inference-config
readinessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 60
periodSeconds: 10
failureThreshold: 3
resources:
limits:
nvidia.com/gpu: 1
memory: 32Gi
requests:
nvidia.com/gpu: 1
memory: 16Gi
volumeMounts:
- name: checkpoint
mountPath: /shared/checkpoints
volumes:
- name: checkpoint
persistentVolumeClaim:
claimName: checkpoint-pvc
---
apiVersion: v1
kind: Service
metadata:
name: inference-service
spec:
type: ClusterIP
selector:
app: llama-inference
ports:
- name: inference
port: 8080
targetPort: 8080
- name: metrics
port: 9090
targetPort: 9090
---
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: inference-ingress
annotations:
nginx.ingress.kubernetes.io/proxy-body-size: "10G"
nginx.ingress.kubernetes.io/proxy-read-timeout: "3600"
spec:
rules:
- host: inference.local
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: inference-service
port:
number: 8080
Verification Tests
Run end-to-end verification:
# 1. Verify all pods running
kubectl get pods -A | grep -v Running
# 2. Test inference endpoint
curl -X POST http://inference.local/completion \
-H "Content-Type: application/json" \
-d '{"prompt": "Explain quantum computing", "max_tokens": 100}' \
-w "\nTime: %{time_total}s\n"
# 3. Verify monitoring采集
curl -s http://inference.local:9090/metrics | grep inference_request_seconds
# 4. Run Slurm job
sbatch <<EOF
#!/bin/bash
#SBATCH --job-name=cluster-verify
#SBATCH --gres=gpu:1
#SBATCH --time=00:05:00
srun nvidia-smi
EOF
squeue
Troubleshooting Checklist
When issues occur:
# Core functionality checks
kubectl get nodes -o wide
kubectl get pods -A -o wide
nvidia-smi
# GPU Operator issues
kubectl logs -n gpu-operator -l app=nvidia-driver
kubectl describe daemonset nvidia-driver-daemonset -n gpu-operator
# Inference service issues
kubectl logs -l app=llama-inference --tail=100
kubectl describe deployment llama-inference
kubectl exec -it $(kubectl get pods -l app=llama-inference -o jsonpath='{.items[0].metadata.name}') \
-- wget -qO- http://localhost:8080/health
# Monitoring gaps
kubectl get svc -n monitoring
kubectl get prometheus -n monitoring
Project Success Criteria
The cluster passes validation with:
- All pods in Running state for 24+ hours
- Inference latency P95 below 2 seconds for single requests
- Slurm job execution with GPU allocation
- DCGM metrics visible in Grafana
- Model loading from MinIO repository
- Zero failed requests during rolling update
- Checkpoint save/restore functionality verified
EXERCISE
Deploy the complete inference infrastructure from this chapter on a development cluster, successfully run 10 inference requests, verify metrics appear in Grafana, then deliberately fail one replica to demonstrate automatic failover recovery.