RUNLOCALAIv38
->Will it run?Best GPUCompareTroubleshootStartLearnPulseModelsHardwareToolsBench
Run check
RUNLOCALAI

Independently operated catalog for local-AI hardware and software. Hand-written verdicts. Source-cited claims. Reproducible commands when we have them.

OP·Fredoline Eruo
DIR
  • Models
  • Hardware
  • Tools
  • Benchmarks
TOOLS
  • Will it run?
  • Compare hardware
  • Cost vs cloud
  • Choose my GPU
  • Prompting kits
  • Quick answers
REF
  • All buyer guides
  • Learn local AI
  • Methodology
  • Glossary
  • Errors KB
  • Trust
EDITOR
  • About
  • Author
  • How we make money
  • Editorial policy
  • Contact
LEGAL
  • Privacy
  • Terms
  • Sitemap
MAIL · MONTHLY DIGEST
Get monthly local AI changes
Monthly recap. No spam.
DISCLOSURE

Some links on this site are affiliate links (Amazon Associates and other first-class retailers). When you buy through them, we earn a small commission at no extra cost to you. Affiliate links do not influence our verdicts — there are cards we rate highly that we don't have affiliate relationships with, and cards that sell well that we refuse to recommend. Read more →

© 2026 runlocalai.coIndependently operated
RUNLOCALAI · v38
  1. >
  2. Home
  3. /Learn
  4. /How-to
  5. /How to manage AI model weights with Kubernetes Persistent Volumes
HOW-TO · OPS

How to manage AI model weights with Kubernetes Persistent Volumes

advanced·30 min·By Fredoline Eruo
Target environment
Ubuntu 24.04 · Ollama 0.4.x
PREREQUISITES

Kubernetes cluster with storage class, model files

What this does

This guide manages AI model weight files on Kubernetes using Persistent Volumes (PVs) and Persistent Volume Claims (PVCs). It covers three strategies: ReadOnlyMany (shared across pods for inference scaling), ReadWriteOnce (single-pod training or fine-tuning), and init-container-based weight downloading (pull weights once, reuse across replicas). Proper PV management avoids the cost and latency of re-downloading multi-gigabyte model files on every pod restart.

Steps

  1. Identify available StorageClasses and their supported access modes:

    kubectl get storageclass
    kubectl describe storageclass <name> | grep -i "access\|reclaim\|provisioner"
    

    Expected output: a list of StorageClasses with their provisioners.

  2. Download model weights into a PV using an init job. Create a Kubernetes Job that runs once to download weights from HuggingFace or S3:

    apiVersion: batch/v1
    kind: Job
    metadata:
      name: download-weights
    spec:
      template:
        spec:
          restartPolicy: OnFailure
          containers:
            - name: downloader
              image: python:3.11-slim
              command: ["/bin/sh", "-c"]
              args:
                - |
                  pip install huggingface_hub boto3 &&
                  python -c "
                  from huggingface_hub import snapshot_download;
                  snapshot_download('meta-llama/Meta-Llama-3-8B-Instruct', local_dir='/data/models/llama-3-8b')
                  "
              volumeMounts:
                - name: models
                  mountPath: /data/models
          volumes:
            - name: models
              persistentVolumeClaim:
                claimName: model-weights-pvc
    
  3. Create the PVC that both the init job and inference pods will share:

    apiVersion: v1
    kind: PersistentVolumeClaim
    metadata:
      name: model-weights-pvc
    spec:
      accessModes: [ReadOnlyMany]
      resources:
        requests:
          storage: 100Gi
      storageClassName: premium-rwx
    
  4. For inference deployments, mount the PVC as read-only:

    volumes:
      - name: models
        persistentVolumeClaim:
          claimName: model-weights-pvc
          readOnly: true
    containers:
      - name: vllm
        volumeMounts:
          - name: models
            mountPath: /models
            readOnly: true
        args: ["--model", "/models/llama-3-8b"]
    
  5. For multi-model serving, organize the PVC directory structure:

    /data/models/
      llama-3-8b/
      mistral-7b/
      embeddings/
        bge-large/
    

    Inference pods reference models by path relative to the mount point.

  6. Apply a retention policy. Keep model PVs in a separate namespace (model-storage) for lifecycle independence from inference workloads:

    kubectl create namespace model-storage
    kubectl apply -f pvc.yaml -n model-storage
    kubectl apply -f download-job.yaml -n model-storage
    
  7. Validate the download job completed successfully and the model files are present:

    kubectl logs job/download-weights -n model-storage | tail -5
    kubectl exec -it deployment/vllm -- ls /models/llama-3-8b/ | head
    

    Expected output: model files like config.json, tokenizer.json, and .safetensors files.

Verification

kubectl exec deployment/vllm -- ls /models/llama-3-8b/ | grep -c safetensors

Expected output: the number of safetensors shard files (typically > 0).

Common failures

  • Weight download job fails with OOM — the downloader container needs enough memory to hold temporary files. Set resources.requests.memory: 4Gi and increase if downloading large models.
  • PVC stuck in Pending — the StorageClass may not support ReadOnlyMany. Check with kubectl describe pvc model-weights-pvc for events. Fall back to ReadWriteOnce (one pod mounts the volume, others use separate copies).
  • Model files corrupted after node failure — PVs backed by block storage (EBS, Azure Disk) are zonal and survive node failures. For node-local storage, use replication or re-download logic in an init container.
  • Multiple models exceed PVC capacity — use separate PVCs per model family and reference them as separate volume mounts in the Deployment spec.

Related guides

  • Deploy vLLM on Kubernetes with GPU node selection
  • Create a Kubernetes operator for managing AI model deployments
  • Pod disruption budgets for AI services during upgrades
← All how-to guidesCourses →