What this does

Deploying AI with Kubernetes and GPU nodes orchestrates machine learning inference workloads across a cluster with hardware acceleration. The deployment uses Kubernetes GPU scheduling to allocate NVIDIA GPUs to model serving pods, manages persistent volumes for model storage, and configures horizontal pod autoscaling based on request latency. The result is a production-ready, scalable AI inference platform that maximizes GPU utilization across multiple models and tenants.

Steps

First, verify GPU availability: kubectl describe nodes | grep nvidia.com/gpu should show allocatable GPU counts on GPU-labeled nodes. Create a namespace: kubectl create namespace ai-inference. Build the inference container image with the model server and push to the registry: docker build -t registry.example.com/inference-server:v1 . && docker push registry.example.com/inference-server:v1. Create a PersistentVolumeClaim for model storage: define a PVC with 50Gi storage and RWO access mode, then create a job that downloads model weights into the volume. Write the deployment manifest: specify nvidia.com/gpu: 1 in resources.limits, mount the model PVC at /models, set OMP_NUM_THREADS=4 for CPU optimization, and define readiness and liveness probes on the health endpoint. Apply the deployment and expose it via a Service of type ClusterIP on port 8080. Install an ingress controller or use port-forward for external access. Configure HPA with kubectl autoscale deployment inference-server --cpu-percent=70 --min=1 --max=10 -n ai-inference. Deploy a monitoring stack with Prometheus and Grafana for GPU utilization metrics.

Record the local run evidence. Save the exact command, runtime or package version, model name if applicable, and observed output so the result can be reproduced later.
Confirm the local starting state. Print the active binary, package version, model name, or configuration path before changing the workflow.
Run the smallest complete path. Execute the minimum command or script that proves the guide works end to end on the local machine.
Compare against expected output. Check the final line, status code, generated artifact, or model response against the verification section before expanding the setup.
Record the local run evidence. Save the exact command, runtime or package version, model name if applicable, and observed output so the result can be reproduced later.

Verification

Run kubectl get pods -n ai-inference and confirm the inference pod shows Running status with 1/1 ready. Check GPU allocation: kubectl describe pod <pod-name> -n ai-inference | grep nvidia.com/gpu. Send a test inference request: curl -X POST http://<service-ip>:8080/v1/inference -H "Content-Type: application/json" -d '{"prompt": "Hello"}' and verify a 200 response with generated text. Run kubectl top pod -n ai-inference to confirm GPU memory utilization is reported. Trigger autoscaling by sending concurrent requests using hey -n 1000 -c 50 http://<service-ip>:8080/v1/inference and verify new pods spin up with kubectl get pods -n ai-inference -w.

Common failures

GPU not recognized by pod: Verify the node has the label nvidia.com/gpu.present=true and the device plugin pod is running in kube-system namespace. Image pull error: Check registry credentials with kubectl create secret docker-registry and add imagePullSecrets to the deployment spec. OOM kills on GPU pods: Set memory limits appropriately—for a 7B model allow at least 16GB RAM; use nvidia-smi to profile actual usage. PVC stuck in Pending: Verify the storage class supports the requested size and access mode; check with kubectl describe pvc. HPA not scaling: Ensure the metrics-server is deployed with kubectl get deployment metrics-server -n kube-system.

Version mismatch - The installed package or runtime differs from the command shown; check the version first and rerun the smallest verification command.
Local environment drift - Another service, virtual environment, model, or path is being used; print the active binary path and configuration before changing the guide steps.

Related guides

setup-auto-scaling-llm-inference
build-multi-tenant-ai-serving
setup-model-versioning-rollback