10. NVIDIA GPU Operator
The GPU Operator automates the lifecycle of NVIDIA software components on Kubernetes, handling drivers, container runtime, device plugins, and monitoring exporters through Operator Framework patterns.
Architecture Overview
The operator installs as a Helm chart and manages these components:
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
helm install --wait --generate-name nvidia/gpu-operator \
--namespace gpu-operator \
--create-namespace
Critical components include nvidia-driver, nvidia-container-toolkit, nvidia-device-plugin, and nvidia-dcgm-exporter for DCGM metrics.
Troubleshooting Driver Issues
Common driver failures manifest as nvidia-smi errors or container runtime incompatibilities:
# Check driver pod status
kubectl get pods -n gpu-operator -o wide
# View driver logs for version mismatches
kubectl logs -n gpu-operator -l app=nvidia-driver
# Fallback: Install drivers via NodeFeatureDiscovery first
kubectl apply -f https://raw.githubusercontent.com/NVIDIA/gpu-operator/v24.9.0/examples/node-feature-discovery.yaml
On some hardware, the operator times out waiting for kernel module compilation. Disabling automatic driver installation and using pre-installed system drivers often resolves this:
helm install nvidia/gpu-operator \
--set driver.enabled=false \
--set driver.repository=nvidia \
--set driver.version=$(cat /proc/driver/nvidia/version | awk '{print $8}')
Container Toolkit Configuration
The operator configures nvidia-container-toolkit to enable GPU access in containers. Verify the runtime configuration:
# Confirm nvidia runtime is registered
cat /etc/containerd/config.toml | grep -A5 nvidia
# For containerd specifically, restart after operator install
systemctl restart containerd
Running nvidia-smi inside a test pod confirms the setup:
kubectl run cuda-test --rm -it --image=nvidia/cuda:12.1.0-base-ubuntu22.04 \
-- nvidia-smi
Failure here indicates the runtime hooks are not intercepting container creation.
Resource Limits with Device Plugin
The device plugin reports GPU resources to the scheduler. Setting limits requires explicit resource requests:
apiVersion: v1
kind: Pod
metadata:
name: gpu-workload
spec:
containers:
- name: inference
image: ghcr.io/gventreulingenAI/llama inference:latest
resources:
limits:
nvidia.com/gpu: 1
requests:
nvidia.com/gpu: 1
Without explicit limits, pods schedule without GPU access and fail at runtime.
Upgrading the Operator
Rolling upgrades of the operator rarely interrupt running workloads since driver pods use a DaemonSet strategy. However, driver upgrades require node reboots which the operator handles through node draining. Plan upgrades during maintenance windows:
kubectl drain node gpu-worker-1 --ignore-daemonsets --delete-emptydir-data
kubectl rollout restart daemonset/nvidia-driver-daemonset -n gpu-operator
kubectl uncordon node gpu-worker-1
Deploy the GPU Operator on a single-node cluster, run the nvidia-smi verification pod, then intentionally misconfigure the container runtime by removing the nvidia runtime entry from /etc/containerd/config.toml and observe the failure mode.