RUNLOCALAIv38
->Will it run?Best GPUCompareTroubleshootStartLearnPulseModelsHardwareToolsBench
Run check
RUNLOCALAI

Independently operated catalog for local-AI hardware and software. Hand-written verdicts. Source-cited claims. Reproducible commands when we have them.

OP·Fredoline Eruo
DIR
  • Models
  • Hardware
  • Tools
  • Benchmarks
TOOLS
  • Will it run?
  • Compare hardware
  • Cost vs cloud
  • Choose my GPU
  • Prompting kits
  • Quick answers
REF
  • All buyer guides
  • Learn local AI
  • Methodology
  • Glossary
  • Errors KB
  • Trust
EDITOR
  • About
  • Author
  • How we make money
  • Editorial policy
  • Contact
LEGAL
  • Privacy
  • Terms
  • Sitemap
MAIL · MONTHLY DIGEST
Get monthly local AI changes
Monthly recap. No spam.
DISCLOSURE

Some links on this site are affiliate links (Amazon Associates and other first-class retailers). When you buy through them, we earn a small commission at no extra cost to you. Affiliate links do not influence our verdicts — there are cards we rate highly that we don't have affiliate relationships with, and cards that sell well that we refuse to recommend. Read more →

© 2026 runlocalai.coIndependently operated
RUNLOCALAI · v38
  1. >
  2. Home
  3. /Learn
  4. /Courses
  5. /Local AI Clusters
  6. /Ch. 10
Local AI Clusters

10. NVIDIA GPU Operator

Chapter 10 of 18 · 20 min
KEY INSIGHT

The GPU Operator centralizes GPU lifecycle management but introduces operator-specific failure modes: driver compilation timeouts on busy nodes, container runtime hook misconfigurations, and the temptation to run mixed driver versions. Explicit driver version pinning and pre-installation of system drivers often beats automatic reconciliation.

The GPU Operator automates the lifecycle of NVIDIA software components on Kubernetes, handling drivers, container runtime, device plugins, and monitoring exporters through Operator Framework patterns.

Architecture Overview

The operator installs as a Helm chart and manages these components:

helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
helm install --wait --generate-name nvidia/gpu-operator \
  --namespace gpu-operator \
  --create-namespace

Critical components include nvidia-driver, nvidia-container-toolkit, nvidia-device-plugin, and nvidia-dcgm-exporter for DCGM metrics.

Troubleshooting Driver Issues

Common driver failures manifest as nvidia-smi errors or container runtime incompatibilities:

# Check driver pod status
kubectl get pods -n gpu-operator -o wide

# View driver logs for version mismatches
kubectl logs -n gpu-operator -l app=nvidia-driver

# Fallback: Install drivers via NodeFeatureDiscovery first
kubectl apply -f https://raw.githubusercontent.com/NVIDIA/gpu-operator/v24.9.0/examples/node-feature-discovery.yaml

On some hardware, the operator times out waiting for kernel module compilation. Disabling automatic driver installation and using pre-installed system drivers often resolves this:

helm install nvidia/gpu-operator \
  --set driver.enabled=false \
  --set driver.repository=nvidia \
  --set driver.version=$(cat /proc/driver/nvidia/version | awk '{print $8}')

Container Toolkit Configuration

The operator configures nvidia-container-toolkit to enable GPU access in containers. Verify the runtime configuration:

# Confirm nvidia runtime is registered
cat /etc/containerd/config.toml | grep -A5 nvidia

# For containerd specifically, restart after operator install
systemctl restart containerd

Running nvidia-smi inside a test pod confirms the setup:

kubectl run cuda-test --rm -it --image=nvidia/cuda:12.1.0-base-ubuntu22.04 \
  -- nvidia-smi

Failure here indicates the runtime hooks are not intercepting container creation.

Resource Limits with Device Plugin

The device plugin reports GPU resources to the scheduler. Setting limits requires explicit resource requests:

apiVersion: v1
kind: Pod
metadata:
  name: gpu-workload
spec:
  containers:
  - name: inference
    image: ghcr.io/gventreulingenAI/llama inference:latest
    resources:
      limits:
        nvidia.com/gpu: 1
      requests:
        nvidia.com/gpu: 1

Without explicit limits, pods schedule without GPU access and fail at runtime.

Upgrading the Operator

Rolling upgrades of the operator rarely interrupt running workloads since driver pods use a DaemonSet strategy. However, driver upgrades require node reboots which the operator handles through node draining. Plan upgrades during maintenance windows:

kubectl drain node gpu-worker-1 --ignore-daemonsets --delete-emptydir-data
kubectl rollout restart daemonset/nvidia-driver-daemonset -n gpu-operator
kubectl uncordon node gpu-worker-1
EXERCISE

Deploy the GPU Operator on a single-node cluster, run the nvidia-smi verification pod, then intentionally misconfigure the container runtime by removing the nvidia runtime entry from /etc/containerd/config.toml and observe the failure mode.

← Chapter 9
Kubernetes Cluster Setup
Chapter 11 →
Slurm for AI