How to manage AI model weights with Kubernetes Persistent Volumes
Kubernetes cluster with storage class, model files
What this does
This guide manages AI model weight files on Kubernetes using Persistent Volumes (PVs) and Persistent Volume Claims (PVCs). It covers three strategies: ReadOnlyMany (shared across pods for inference scaling), ReadWriteOnce (single-pod training or fine-tuning), and init-container-based weight downloading (pull weights once, reuse across replicas). Proper PV management avoids the cost and latency of re-downloading multi-gigabyte model files on every pod restart.
Steps
Identify available StorageClasses and their supported access modes:
kubectl get storageclass kubectl describe storageclass <name> | grep -i "access\|reclaim\|provisioner"Expected output: a list of StorageClasses with their provisioners.
Download model weights into a PV using an init job. Create a Kubernetes Job that runs once to download weights from HuggingFace or S3:
apiVersion: batch/v1 kind: Job metadata: name: download-weights spec: template: spec: restartPolicy: OnFailure containers: - name: downloader image: python:3.11-slim command: ["/bin/sh", "-c"] args: - | pip install huggingface_hub boto3 && python -c " from huggingface_hub import snapshot_download; snapshot_download('meta-llama/Meta-Llama-3-8B-Instruct', local_dir='/data/models/llama-3-8b') " volumeMounts: - name: models mountPath: /data/models volumes: - name: models persistentVolumeClaim: claimName: model-weights-pvcCreate the PVC that both the init job and inference pods will share:
apiVersion: v1 kind: PersistentVolumeClaim metadata: name: model-weights-pvc spec: accessModes: [ReadOnlyMany] resources: requests: storage: 100Gi storageClassName: premium-rwxFor inference deployments, mount the PVC as read-only:
volumes: - name: models persistentVolumeClaim: claimName: model-weights-pvc readOnly: true containers: - name: vllm volumeMounts: - name: models mountPath: /models readOnly: true args: ["--model", "/models/llama-3-8b"]For multi-model serving, organize the PVC directory structure:
/data/models/ llama-3-8b/ mistral-7b/ embeddings/ bge-large/Inference pods reference models by path relative to the mount point.
Apply a retention policy. Keep model PVs in a separate namespace (
model-storage) for lifecycle independence from inference workloads:kubectl create namespace model-storage kubectl apply -f pvc.yaml -n model-storage kubectl apply -f download-job.yaml -n model-storageValidate the download job completed successfully and the model files are present:
kubectl logs job/download-weights -n model-storage | tail -5 kubectl exec -it deployment/vllm -- ls /models/llama-3-8b/ | headExpected output: model files like
config.json,tokenizer.json, and.safetensorsfiles.
Verification
kubectl exec deployment/vllm -- ls /models/llama-3-8b/ | grep -c safetensors
Expected output: the number of safetensors shard files (typically > 0).
Common failures
- Weight download job fails with OOM — the downloader container needs enough memory to hold temporary files. Set
resources.requests.memory: 4Giand increase if downloading large models. - PVC stuck in Pending — the StorageClass may not support
ReadOnlyMany. Check withkubectl describe pvc model-weights-pvcfor events. Fall back to ReadWriteOnce (one pod mounts the volume, others use separate copies). - Model files corrupted after node failure — PVs backed by block storage (EBS, Azure Disk) are zonal and survive node failures. For node-local storage, use replication or re-download logic in an init container.
- Multiple models exceed PVC capacity — use separate PVCs per model family and reference them as separate volume mounts in the Deployment spec.