What this does

This guide manages AI model weight files on Kubernetes using Persistent Volumes (PVs) and Persistent Volume Claims (PVCs). It covers three strategies: ReadOnlyMany (shared across pods for inference scaling), ReadWriteOnce (single-pod training or fine-tuning), and init-container-based weight downloading (pull weights once, reuse across replicas). Proper PV management avoids the cost and latency of re-downloading multi-gigabyte model files on every pod restart.

Steps

Identify available StorageClasses and their supported access modes:
```
kubectl get storageclass
kubectl describe storageclass <name> | grep -i "access\|reclaim\|provisioner"
```
Expected output: a list of StorageClasses with their provisioners.

Download model weights into a PV using an init job. Create a Kubernetes Job that runs once to download weights from HuggingFace or S3:

apiVersion: batch/v1
kind: Job
metadata:
  name: download-weights
spec:
  template:
    spec:
      restartPolicy: OnFailure
      containers:
        - name: downloader
          image: python:3.11-slim
          command: ["/bin/sh", "-c"]
          args:
            - |
              pip install huggingface_hub boto3 &&
              python -c "
              from huggingface_hub import snapshot_download;
              snapshot_download('meta-llama/Meta-Llama-3-8B-Instruct', local_dir='/data/models/llama-3-8b')
              "
          volumeMounts:
            - name: models
              mountPath: /data/models
      volumes:
        - name: models
          persistentVolumeClaim:
            claimName: model-weights-pvc

Create the PVC that both the init job and inference pods will share:

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: model-weights-pvc
spec:
  accessModes: [ReadOnlyMany]
  resources:
    requests:
      storage: 100Gi
  storageClassName: premium-rwx

For inference deployments, mount the PVC as read-only:

volumes:
  - name: models
    persistentVolumeClaim:
      claimName: model-weights-pvc
      readOnly: true
containers:
  - name: vllm
    volumeMounts:
      - name: models
        mountPath: /models
        readOnly: true
    args: ["--model", "/models/llama-3-8b"]

For multi-model serving, organize the PVC directory structure:
```
/data/models/
  llama-3-8b/
  mistral-7b/
  embeddings/
    bge-large/
```
Inference pods reference models by path relative to the mount point.

Apply a retention policy. Keep model PVs in a separate namespace (model-storage) for lifecycle independence from inference workloads:

kubectl create namespace model-storage
kubectl apply -f pvc.yaml -n model-storage
kubectl apply -f download-job.yaml -n model-storage

Validate the download job completed successfully and the model files are present:
```
kubectl logs job/download-weights -n model-storage | tail -5
kubectl exec -it deployment/vllm -- ls /models/llama-3-8b/ | head
```
Expected output: model files like config.json, tokenizer.json, and .safetensors files.

Verification

kubectl exec deployment/vllm -- ls /models/llama-3-8b/ | grep -c safetensors

Expected output: the number of safetensors shard files (typically > 0).

Common failures

Weight download job fails with OOM — the downloader container needs enough memory to hold temporary files. Set resources.requests.memory: 4Gi and increase if downloading large models.
PVC stuck in Pending — the StorageClass may not support ReadOnlyMany. Check with kubectl describe pvc model-weights-pvc for events. Fall back to ReadWriteOnce (one pod mounts the volume, others use separate copies).
Model files corrupted after node failure — PVs backed by block storage (EBS, Azure Disk) are zonal and survive node failures. For node-local storage, use replication or re-download logic in an init container.
Multiple models exceed PVC capacity — use separate PVCs per model family and reference them as separate volume mounts in the Deployment spec.

How to manage AI model weights with Kubernetes Persistent Volumes

What this does

Steps

Verification

Common failures

Related guides