NVIDIA GPU Operator — Local AI Clusters (Chapter 10)

The GPU Operator automates the lifecycle of NVIDIA software components on Kubernetes, handling drivers, container runtime, device plugins, and monitoring exporters through Operator Framework patterns.

Architecture Overview

The operator installs as a Helm chart and manages these components:

helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
helm install --wait --generate-name nvidia/gpu-operator \
  --namespace gpu-operator \
  --create-namespace

Critical components include nvidia-driver, nvidia-container-toolkit, nvidia-device-plugin, and nvidia-dcgm-exporter for DCGM metrics.

Troubleshooting Driver Issues

Common driver failures manifest as nvidia-smi errors or container runtime incompatibilities:

# Check driver pod status
kubectl get pods -n gpu-operator -o wide

# View driver logs for version mismatches
kubectl logs -n gpu-operator -l app=nvidia-driver

# Fallback: Install drivers via NodeFeatureDiscovery first
kubectl apply -f https://raw.githubusercontent.com/NVIDIA/gpu-operator/v24.9.0/examples/node-feature-discovery.yaml

On some hardware, the operator times out waiting for kernel module compilation. Disabling automatic driver installation and using pre-installed system drivers often resolves this:

helm install nvidia/gpu-operator \
  --set driver.enabled=false \
  --set driver.repository=nvidia \
  --set driver.version=$(cat /proc/driver/nvidia/version | awk '{print $8}')

Container Toolkit Configuration

The operator configures nvidia-container-toolkit to enable GPU access in containers. Verify the runtime configuration:

# Confirm nvidia runtime is registered
cat /etc/containerd/config.toml | grep -A5 nvidia

# For containerd specifically, restart after operator install
systemctl restart containerd

Running nvidia-smi inside a test pod confirms the setup:

kubectl run cuda-test --rm -it --image=nvidia/cuda:12.1.0-base-ubuntu22.04 \
  -- nvidia-smi

Failure here indicates the runtime hooks are not intercepting container creation.

Resource Limits with Device Plugin

The device plugin reports GPU resources to the scheduler. Setting limits requires explicit resource requests:

apiVersion: v1
kind: Pod
metadata:
  name: gpu-workload
spec:
  containers:
  - name: inference
    image: ghcr.io/gventreulingenAI/llama inference:latest
    resources:
      limits:
        nvidia.com/gpu: 1
      requests:
        nvidia.com/gpu: 1

Without explicit limits, pods schedule without GPU access and fail at runtime.

Upgrading the Operator

Rolling upgrades of the operator rarely interrupt running workloads since driver pods use a DaemonSet strategy. However, driver upgrades require node reboots which the operator handles through node draining. Plan upgrades during maintenance windows:

kubectl drain node gpu-worker-1 --ignore-daemonsets --delete-emptydir-data
kubectl rollout restart daemonset/nvidia-driver-daemonset -n gpu-operator
kubectl uncordon node gpu-worker-1