How to deploy vLLM on Kubernetes with GPU node selection
Kubernetes cluster with GPU nodes, kubectl configured
What this does
This guide deploys the vLLM inference server as a Kubernetes Deployment with GPU node affinity, tolerations, and resource requests. The deployment includes a NodeSelector to target GPU-labelled nodes, a toleration for GPU taints, and explicit GPU resource requests to ensure the scheduler places the pod on a node with sufficient VRAM. A Service and Ingress complete the configuration for external API access.
Steps
Verify GPU nodes are visible and labelled:
kubectl get nodes -l accelerator=nvidiaExpected output: at least one node with STATUS
Ready.Check GPU capacity on the target node:
kubectl describe node <gpu-node-name> | grep nvidia.com/gpuExpected output:
nvidia.com/gpu: 1(or however many GPUs the node has).Create a PVC (PersistentVolumeClaim) for model storage if not using hostPath. For small-to-medium models, use a PVC backed by a fast SSD storage class:
apiVersion: v1 kind: PersistentVolumeClaim metadata: name: vllm-models spec: accessModes: [ReadOnlyMany] resources: requests: storage: 100Gi storageClassName: fast-ssdCreate the vLLM Deployment manifest
vllm-deployment.yaml:apiVersion: apps/v1 kind: Deployment metadata: name: vllm-inference spec: replicas: 1 selector: matchLabels: app: vllm template: metadata: labels: app: vllm spec: nodeSelector: accelerator: nvidia tolerations: - key: "nvidia.com/gpu" operator: "Exists" effect: "NoSchedule" containers: - name: vllm image: vllm/vllm-openai:latest ports: - containerPort: 8000 resources: limits: nvidia.com/gpu: 1 memory: "32Gi" requests: nvidia.com/gpu: 1 memory: "16Gi" volumeMounts: - name: models mountPath: /models env: - name: NVIDIA_VISIBLE_DEVICES value: "all" args: - "--model" - "/models/Meta-Llama-3-8B-Instruct" - "--max-model-len" - "8192" - "--gpu-memory-utilization" - "0.90" volumes: - name: models persistentVolumeClaim: claimName: vllm-modelsCreate a Service to expose the deployment internally:
apiVersion: v1 kind: Service metadata: name: vllm-service spec: selector: app: vllm ports: - port: 8000 targetPort: 8000 type: ClusterIPApply the manifests:
kubectl apply -f vllm-pvc.yaml -f vllm-deployment.yaml -f vllm-service.yamlMonitor the deployment rollout:
kubectl get pods -l app=vllm -wExpected output: pod transitions from
PendingtoContainerCreatingtoRunning. This can take 2-5 minutes depending on model size.Verify the vLLM API is responsive:
kubectl port-forward svc/vllm-service 8000:8000 & curl http://localhost:8000/healthExpected output: HTTP 200.
Verification
kubectl logs deployment/vllm-inference | grep "Uvicorn running"
Expected output: a line containing Uvicorn running on http://0.0.0.0:8000 confirming the server is ready.
Common failures
- Pod stuck in Pending — the GPU node may not have enough free GPU resources. Check with
kubectl describe node <gpu-node> | grep -A5 "Allocated resources"and ensurenvidia.com/gpuhas available capacity. - Model not found in container — confirm the PVC is populated with model weights and mounted correctly. Exec into the pod:
kubectl exec -it deployment/vllm-inference -- ls /models/. - CUDA error on startup — the NVIDIA Device Plugin may not be running. Verify:
kubectl get pods -n kube-system | grep nvidia. If absent, install it:kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.14.5/nvidia-device-plugin.yml. - OOMKilled after serving requests — reduce
--max-model-lenor increase the memory limit. Check the pod's memory usage:kubectl top pod -l app=vllm.