Local AI Cluster Project — Local AI Clusters (Chapter 18)

This chapter integrates concepts from chapters 1-17 into a complete, functioning local AI cluster serving production inference workloads.

Final Architecture

The project implements:

Kubernetes cluster with 2+ worker nodes
NVIDIA GPU Operator managing driver lifecycle
Slurm for batch training workloads
MinIO model repository with versioning
NGINX Ingress with health checks
kube-prometheus-stack with DCGM metrics
Distributed inference serving with checkpoint backup

Deployment Sequence

Deploy components in dependency order:

# 1. Kubernetes cluster initialization (Chapter 3)
kubeadm init --control-plane-endpoint "cluster.local:6443"

# 2. NVIDIA GPU Operator (Chapter 10)
helm install gpu-operator nvidia/gpu-operator \
  --namespace gpu-operator --create-namespace

# 3. Networking and storage
kubectl apply -f networking.yaml  # CNI configuration
helm install nfs-server stable/nfs-server-provisioner \
  --namespace storage --create-namespace

# 4. Slurm for batch workloads (Chapter 11)
apt-get install -y slurmd slurmctld mariadb-server
systemctl enable --now slurmctld slurmd

# 5. Model repository (Chapter 12)
helm install minio minio/minio \
  --namespace model-storage --create-namespace \
  --set persistence.size=1Ti

# 6. Monitoring (Chapter 14)
helm install prometheus prometheus-community/kube-prometheus-stack \
  --namespace monitoring --create-namespace

# 7. Load balancing ingress (Chapter 13)
helm install ingress-nginx ingress-nginx/ingress-nginx \
  --namespace ingress --create-namespace

Inference Serving Deployment

The inference deployment implements fault tolerance patterns:

apiVersion: v1
kind: ConfigMap
metadata:
  name: inference-config
data:
  MODEL_PATH: "s3://models/llama-3-8b.Q4_K_M.gguf"
  MAX_BATCH_SIZE: "16"
  CONTEXT_LENGTH: "4096"
apiVersion: apps/v1
kind: Deployment
metadata:
  name: llama-inference
  labels:
    app: llama-inference
    version: v2.2
spec:
  replicas: 3
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 1
      maxUnavailable: 0
  selector:
    matchLabels:
      app: llama-inference
  template:
    metadata:
      labels:
        app: llama-inference
        version: v2.2
    spec:
      affinity:
        podAntiAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
          - weight: 100
            podAffinityTerm:
              labelSelector:
                matchExpressions:
                - key: app
                  operator: In
                  values:
                  - llama-inference
              topologyKey: kubernetes.io/hostname
      containers:
      - name: inference
        image: ghcr.io/gventroulingenAI/llama.cpp:v2.2
        ports:
        - containerPort: 8080
          name: inference
        - containerPort: 9090
          name: metrics
        envFrom:
        - configMapRef:
            name: inference-config
        readinessProbe:
          httpGet:
            path: /health
            port: 8080
          initialDelaySeconds: 60
          periodSeconds: 10
          failureThreshold: 3
        resources:
          limits:
            nvidia.com/gpu: 1
            memory: 32Gi
          requests:
            nvidia.com/gpu: 1
            memory: 16Gi
        volumeMounts:
        - name: checkpoint
          mountPath: /shared/checkpoints
      volumes:
      - name: checkpoint
        persistentVolumeClaim:
          claimName: checkpoint-pvc
---
apiVersion: v1
kind: Service
metadata:
  name: inference-service
spec:
  type: ClusterIP
  selector:
    app: llama-inference
  ports:
  - name: inference
    port: 8080
    targetPort: 8080
  - name: metrics
    port: 9090
    targetPort: 9090
---
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: inference-ingress
  annotations:
    nginx.ingress.kubernetes.io/proxy-body-size: "10G"
    nginx.ingress.kubernetes.io/proxy-read-timeout: "3600"
spec:
  rules:
  - host: inference.local
    http:
      paths:
      - path: /
        pathType: Prefix
        backend:
          service:
            name: inference-service
            port:
              number: 8080

Verification Tests

Run end-to-end verification:

# 1. Verify all pods running
kubectl get pods -A | grep -v Running

# 2. Test inference endpoint
curl -X POST http://inference.local/completion \
  -H "Content-Type: application/json" \
  -d '{"prompt": "Explain quantum computing", "max_tokens": 100}' \
  -w "\nTime: %{time_total}s\n"

# 3. Verify monitoring采集
curl -s http://inference.local:9090/metrics | grep inference_request_seconds

# 4. Run Slurm job
sbatch <<EOF
#!/bin/bash
#SBATCH --job-name=cluster-verify
#SBATCH --gres=gpu:1
#SBATCH --time=00:05:00
srun nvidia-smi
EOF
squeue

Troubleshooting Checklist

When issues occur:

# Core functionality checks
kubectl get nodes -o wide
kubectl get pods -A -o wide
nvidia-smi

# GPU Operator issues
kubectl logs -n gpu-operator -l app=nvidia-driver
kubectl describe daemonset nvidia-driver-daemonset -n gpu-operator

# Inference service issues
kubectl logs -l app=llama-inference --tail=100
kubectl describe deployment llama-inference
kubectl exec -it $(kubectl get pods -l app=llama-inference -o jsonpath='{.items[0].metadata.name}') \
  -- wget -qO- http://localhost:8080/health

# Monitoring gaps
kubectl get svc -n monitoring
kubectl get prometheus -n monitoring

Project Success Criteria

The cluster passes validation with:

All pods in Running state for 24+ hours
Inference latency P95 below 2 seconds for single requests
Slurm job execution with GPU allocation
DCGM metrics visible in Grafana
Model loading from MinIO repository
Zero failed requests during rolling update
Checkpoint save/restore functionality verified