RUNLOCALAIv38
->Will it run?Best GPUCompareTroubleshootStartLearnPulseModelsHardwareToolsBench
Run check
RUNLOCALAI

Independently operated catalog for local-AI hardware and software. Hand-written verdicts. Source-cited claims. Reproducible commands when we have them.

OP·Fredoline Eruo
DIR
  • Models
  • Hardware
  • Tools
  • Benchmarks
TOOLS
  • Will it run?
  • Compare hardware
  • Cost vs cloud
  • Choose my GPU
  • Prompting kits
  • Quick answers
REF
  • All buyer guides
  • Learn local AI
  • Methodology
  • Glossary
  • Errors KB
  • Trust
EDITOR
  • About
  • Author
  • How we make money
  • Editorial policy
  • Contact
LEGAL
  • Privacy
  • Terms
  • Sitemap
MAIL · MONTHLY DIGEST
Get monthly local AI changes
Monthly recap. No spam.
DISCLOSURE

Some links on this site are affiliate links (Amazon Associates and other first-class retailers). When you buy through them, we earn a small commission at no extra cost to you. Affiliate links do not influence our verdicts — there are cards we rate highly that we don't have affiliate relationships with, and cards that sell well that we refuse to recommend. Read more →

© 2026 runlocalai.coIndependently operated
RUNLOCALAI · v38
  1. >
  2. Home
  3. /Learn
  4. /How-to
  5. /How to deploy vLLM on Kubernetes with GPU node selection
HOW-TO · OPS

How to deploy vLLM on Kubernetes with GPU node selection

advanced·35 min·By Fredoline Eruo
Target environment
Ubuntu 24.04 · Ollama 0.4.x
PREREQUISITES

Kubernetes cluster with GPU nodes, kubectl configured

What this does

This guide deploys the vLLM inference server as a Kubernetes Deployment with GPU node affinity, tolerations, and resource requests. The deployment includes a NodeSelector to target GPU-labelled nodes, a toleration for GPU taints, and explicit GPU resource requests to ensure the scheduler places the pod on a node with sufficient VRAM. A Service and Ingress complete the configuration for external API access.

Steps

  1. Verify GPU nodes are visible and labelled:

    kubectl get nodes -l accelerator=nvidia
    

    Expected output: at least one node with STATUS Ready.

  2. Check GPU capacity on the target node:

    kubectl describe node <gpu-node-name> | grep nvidia.com/gpu
    

    Expected output: nvidia.com/gpu: 1 (or however many GPUs the node has).

  3. Create a PVC (PersistentVolumeClaim) for model storage if not using hostPath. For small-to-medium models, use a PVC backed by a fast SSD storage class:

    apiVersion: v1
    kind: PersistentVolumeClaim
    metadata:
      name: vllm-models
    spec:
      accessModes: [ReadOnlyMany]
      resources:
        requests:
          storage: 100Gi
      storageClassName: fast-ssd
    
  4. Create the vLLM Deployment manifest vllm-deployment.yaml:

    apiVersion: apps/v1
    kind: Deployment
    metadata:
      name: vllm-inference
    spec:
      replicas: 1
      selector:
        matchLabels:
          app: vllm
      template:
        metadata:
          labels:
            app: vllm
        spec:
          nodeSelector:
            accelerator: nvidia
          tolerations:
            - key: "nvidia.com/gpu"
              operator: "Exists"
              effect: "NoSchedule"
          containers:
            - name: vllm
              image: vllm/vllm-openai:latest
              ports:
                - containerPort: 8000
              resources:
                limits:
                  nvidia.com/gpu: 1
                  memory: "32Gi"
                requests:
                  nvidia.com/gpu: 1
                  memory: "16Gi"
              volumeMounts:
                - name: models
                  mountPath: /models
              env:
                - name: NVIDIA_VISIBLE_DEVICES
                  value: "all"
              args:
                - "--model"
                - "/models/Meta-Llama-3-8B-Instruct"
                - "--max-model-len"
                - "8192"
                - "--gpu-memory-utilization"
                - "0.90"
          volumes:
            - name: models
              persistentVolumeClaim:
                claimName: vllm-models
    
  5. Create a Service to expose the deployment internally:

    apiVersion: v1
    kind: Service
    metadata:
      name: vllm-service
    spec:
      selector:
        app: vllm
      ports:
        - port: 8000
          targetPort: 8000
      type: ClusterIP
    
  6. Apply the manifests:

    kubectl apply -f vllm-pvc.yaml -f vllm-deployment.yaml -f vllm-service.yaml
    
  7. Monitor the deployment rollout:

    kubectl get pods -l app=vllm -w
    

    Expected output: pod transitions from Pending to ContainerCreating to Running. This can take 2-5 minutes depending on model size.

  8. Verify the vLLM API is responsive:

    kubectl port-forward svc/vllm-service 8000:8000 &
    curl http://localhost:8000/health
    

    Expected output: HTTP 200.

Verification

kubectl logs deployment/vllm-inference | grep "Uvicorn running"

Expected output: a line containing Uvicorn running on http://0.0.0.0:8000 confirming the server is ready.

Common failures

  • Pod stuck in Pending — the GPU node may not have enough free GPU resources. Check with kubectl describe node <gpu-node> | grep -A5 "Allocated resources" and ensure nvidia.com/gpu has available capacity.
  • Model not found in container — confirm the PVC is populated with model weights and mounted correctly. Exec into the pod: kubectl exec -it deployment/vllm-inference -- ls /models/.
  • CUDA error on startup — the NVIDIA Device Plugin may not be running. Verify: kubectl get pods -n kube-system | grep nvidia. If absent, install it: kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.14.5/nvidia-device-plugin.yml.
  • OOMKilled after serving requests — reduce --max-model-len or increase the memory limit. Check the pod's memory usage: kubectl top pod -l app=vllm.

Related guides

  • Horizontal pod autoscaling for AI inference services
  • Manage AI model weights with Kubernetes Persistent Volumes
  • Pod disruption budgets for AI services during upgrades
← All how-to guidesCourses →