RUNLOCALAIv38
->Will it run?Best GPUCompareTroubleshootStartLearnPulseModelsHardwareToolsBench
Run check
RUNLOCALAI

Independently operated catalog for local-AI hardware and software. Hand-written verdicts. Source-cited claims. Reproducible commands when we have them.

OP·Fredoline Eruo
DIR
  • Models
  • Hardware
  • Tools
  • Benchmarks
TOOLS
  • Will it run?
  • Compare hardware
  • Cost vs cloud
  • Choose my GPU
  • Prompting kits
  • Quick answers
REF
  • All buyer guides
  • Learn local AI
  • Methodology
  • Glossary
  • Errors KB
  • Trust
EDITOR
  • About
  • Author
  • How we make money
  • Editorial policy
  • Contact
LEGAL
  • Privacy
  • Terms
  • Sitemap
MAIL · MONTHLY DIGEST
Get monthly local AI changes
Monthly recap. No spam.
DISCLOSURE

Some links on this site are affiliate links (Amazon Associates and other first-class retailers). When you buy through them, we earn a small commission at no extra cost to you. Affiliate links do not influence our verdicts — there are cards we rate highly that we don't have affiliate relationships with, and cards that sell well that we refuse to recommend. Read more →

© 2026 runlocalai.coIndependently operated
RUNLOCALAI · v38
  1. >
  2. Home
  3. /Learn
  4. /Courses
  5. /Local AI Clusters
  6. /Ch. 18
Local AI Clusters

18. Local AI Cluster Project

Chapter 18 of 18 · 25 min
KEY INSIGHT

Every local AI cluster is a living system requiring ongoing attention to driver updates, model rotations, and monitoring gaps. The patterns established across these chapters—GPU Operator management, Slurm scheduling, repository versioning, load balancing, and fault tolerance—transform individual commands into an integrated, maintainable platform.

This chapter integrates concepts from chapters 1-17 into a complete, functioning local AI cluster serving production inference workloads.

Final Architecture

The project implements:

  • Kubernetes cluster with 2+ worker nodes
  • NVIDIA GPU Operator managing driver lifecycle
  • Slurm for batch training workloads
  • MinIO model repository with versioning
  • NGINX Ingress with health checks
  • kube-prometheus-stack with DCGM metrics
  • Distributed inference serving with checkpoint backup

Deployment Sequence

Deploy components in dependency order:

# 1. Kubernetes cluster initialization (Chapter 3)
kubeadm init --control-plane-endpoint "cluster.local:6443"

# 2. NVIDIA GPU Operator (Chapter 10)
helm install gpu-operator nvidia/gpu-operator \
  --namespace gpu-operator --create-namespace

# 3. Networking and storage
kubectl apply -f networking.yaml  # CNI configuration
helm install nfs-server stable/nfs-server-provisioner \
  --namespace storage --create-namespace

# 4. Slurm for batch workloads (Chapter 11)
apt-get install -y slurmd slurmctld mariadb-server
systemctl enable --now slurmctld slurmd

# 5. Model repository (Chapter 12)
helm install minio minio/minio \
  --namespace model-storage --create-namespace \
  --set persistence.size=1Ti

# 6. Monitoring (Chapter 14)
helm install prometheus prometheus-community/kube-prometheus-stack \
  --namespace monitoring --create-namespace

# 7. Load balancing ingress (Chapter 13)
helm install ingress-nginx ingress-nginx/ingress-nginx \
  --namespace ingress --create-namespace

Inference Serving Deployment

The inference deployment implements fault tolerance patterns:

apiVersion: v1
kind: ConfigMap
metadata:
  name: inference-config
data:
  MODEL_PATH: "s3://models/llama-3-8b.Q4_K_M.gguf"
  MAX_BATCH_SIZE: "16"
  CONTEXT_LENGTH: "4096"
apiVersion: apps/v1
kind: Deployment
metadata:
  name: llama-inference
  labels:
    app: llama-inference
    version: v2.2
spec:
  replicas: 3
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 1
      maxUnavailable: 0
  selector:
    matchLabels:
      app: llama-inference
  template:
    metadata:
      labels:
        app: llama-inference
        version: v2.2
    spec:
      affinity:
        podAntiAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
          - weight: 100
            podAffinityTerm:
              labelSelector:
                matchExpressions:
                - key: app
                  operator: In
                  values:
                  - llama-inference
              topologyKey: kubernetes.io/hostname
      containers:
      - name: inference
        image: ghcr.io/gventroulingenAI/llama.cpp:v2.2
        ports:
        - containerPort: 8080
          name: inference
        - containerPort: 9090
          name: metrics
        envFrom:
        - configMapRef:
            name: inference-config
        readinessProbe:
          httpGet:
            path: /health
            port: 8080
          initialDelaySeconds: 60
          periodSeconds: 10
          failureThreshold: 3
        resources:
          limits:
            nvidia.com/gpu: 1
            memory: 32Gi
          requests:
            nvidia.com/gpu: 1
            memory: 16Gi
        volumeMounts:
        - name: checkpoint
          mountPath: /shared/checkpoints
      volumes:
      - name: checkpoint
        persistentVolumeClaim:
          claimName: checkpoint-pvc
---
apiVersion: v1
kind: Service
metadata:
  name: inference-service
spec:
  type: ClusterIP
  selector:
    app: llama-inference
  ports:
  - name: inference
    port: 8080
    targetPort: 8080
  - name: metrics
    port: 9090
    targetPort: 9090
---
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: inference-ingress
  annotations:
    nginx.ingress.kubernetes.io/proxy-body-size: "10G"
    nginx.ingress.kubernetes.io/proxy-read-timeout: "3600"
spec:
  rules:
  - host: inference.local
    http:
      paths:
      - path: /
        pathType: Prefix
        backend:
          service:
            name: inference-service
            port:
              number: 8080

Verification Tests

Run end-to-end verification:

# 1. Verify all pods running
kubectl get pods -A | grep -v Running

# 2. Test inference endpoint
curl -X POST http://inference.local/completion \
  -H "Content-Type: application/json" \
  -d '{"prompt": "Explain quantum computing", "max_tokens": 100}' \
  -w "\nTime: %{time_total}s\n"

# 3. Verify monitoring采集
curl -s http://inference.local:9090/metrics | grep inference_request_seconds

# 4. Run Slurm job
sbatch <<EOF
#!/bin/bash
#SBATCH --job-name=cluster-verify
#SBATCH --gres=gpu:1
#SBATCH --time=00:05:00
srun nvidia-smi
EOF
squeue

Troubleshooting Checklist

When issues occur:

# Core functionality checks
kubectl get nodes -o wide
kubectl get pods -A -o wide
nvidia-smi

# GPU Operator issues
kubectl logs -n gpu-operator -l app=nvidia-driver
kubectl describe daemonset nvidia-driver-daemonset -n gpu-operator

# Inference service issues
kubectl logs -l app=llama-inference --tail=100
kubectl describe deployment llama-inference
kubectl exec -it $(kubectl get pods -l app=llama-inference -o jsonpath='{.items[0].metadata.name}') \
  -- wget -qO- http://localhost:8080/health

# Monitoring gaps
kubectl get svc -n monitoring
kubectl get prometheus -n monitoring

Project Success Criteria

The cluster passes validation with:

  • All pods in Running state for 24+ hours
  • Inference latency P95 below 2 seconds for single requests
  • Slurm job execution with GPU allocation
  • DCGM metrics visible in Grafana
  • Model loading from MinIO repository
  • Zero failed requests during rolling update
  • Checkpoint save/restore functionality verified
EXERCISE

Deploy the complete inference infrastructure from this chapter on a development cluster, successfully run 10 inference requests, verify metrics appear in Grafana, then deliberately fail one replica to demonstrate automatic failover recovery.

← Chapter 17
Cluster Benchmarking
Course complete →
Browse all courses