What this does

This guide deploys the vLLM inference server as a Kubernetes Deployment with GPU node affinity, tolerations, and resource requests. The deployment includes a NodeSelector to target GPU-labelled nodes, a toleration for GPU taints, and explicit GPU resource requests to ensure the scheduler places the pod on a node with sufficient VRAM. A Service and Ingress complete the configuration for external API access.

Steps

Verify GPU nodes are visible and labelled:
```
kubectl get nodes -l accelerator=nvidia
```
Expected output: at least one node with STATUS Ready.
Check GPU capacity on the target node:
```
kubectl describe node <gpu-node-name> | grep nvidia.com/gpu
```
Expected output: nvidia.com/gpu: 1 (or however many GPUs the node has).

Create a PVC (PersistentVolumeClaim) for model storage if not using hostPath. For small-to-medium models, use a PVC backed by a fast SSD storage class:

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: vllm-models
spec:
  accessModes: [ReadOnlyMany]
  resources:
    requests:
      storage: 100Gi
  storageClassName: fast-ssd

Create the vLLM Deployment manifest vllm-deployment.yaml:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: vllm-inference
spec:
  replicas: 1
  selector:
    matchLabels:
      app: vllm
  template:
    metadata:
      labels:
        app: vllm
    spec:
      nodeSelector:
        accelerator: nvidia
      tolerations:
        - key: "nvidia.com/gpu"
          operator: "Exists"
          effect: "NoSchedule"
      containers:
        - name: vllm
          image: vllm/vllm-openai:latest
          ports:
            - containerPort: 8000
          resources:
            limits:
              nvidia.com/gpu: 1
              memory: "32Gi"
            requests:
              nvidia.com/gpu: 1
              memory: "16Gi"
          volumeMounts:
            - name: models
              mountPath: /models
          env:
            - name: NVIDIA_VISIBLE_DEVICES
              value: "all"
          args:
            - "--model"
            - "/models/Meta-Llama-3-8B-Instruct"
            - "--max-model-len"
            - "8192"
            - "--gpu-memory-utilization"
            - "0.90"
      volumes:
        - name: models
          persistentVolumeClaim:
            claimName: vllm-models

Create a Service to expose the deployment internally:

apiVersion: v1
kind: Service
metadata:
  name: vllm-service
spec:
  selector:
    app: vllm
  ports:
    - port: 8000
      targetPort: 8000
  type: ClusterIP

Apply the manifests:

kubectl apply -f vllm-pvc.yaml -f vllm-deployment.yaml -f vllm-service.yaml

Monitor the deployment rollout:
```
kubectl get pods -l app=vllm -w
```
Expected output: pod transitions from Pending to ContainerCreating to Running. This can take 2-5 minutes depending on model size.

Verify the vLLM API is responsive:

kubectl port-forward svc/vllm-service 8000:8000 &
curl http://localhost:8000/health

Expected output: HTTP 200.

Verification

kubectl logs deployment/vllm-inference | grep "Uvicorn running"

Expected output: a line containing Uvicorn running on http://0.0.0.0:8000 confirming the server is ready.

Common failures

Pod stuck in Pending — the GPU node may not have enough free GPU resources. Check with kubectl describe node <gpu-node> | grep -A5 "Allocated resources" and ensure nvidia.com/gpu has available capacity.
Model not found in container — confirm the PVC is populated with model weights and mounted correctly. Exec into the pod: kubectl exec -it deployment/vllm-inference -- ls /models/.
CUDA error on startup — the NVIDIA Device Plugin may not be running. Verify: kubectl get pods -n kube-system | grep nvidia. If absent, install it: kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.14.5/nvidia-device-plugin.yml.
OOMKilled after serving requests — reduce --max-model-len or increase the memory limit. Check the pod's memory usage: kubectl top pod -l app=vllm.

How to deploy vLLM on Kubernetes with GPU node selection

What this does

Steps

Verification

Common failures

Related guides