12. Model Repository
A model repository provides centralized storage, versioning, and distribution for AI models across cluster nodes, eliminating redundant downloads and enabling rapid inference deployment.
Architecture: Storing on a Filesystem vs Object Storage
Small clusters often use shared filesystem storage (NFS, CephFS) for simplicity. Object storage (MinIO, S3) scales better for large model archives and provides built-in versioning:
# Deploy MinIO for S3-compatible object storage
helm install minio minio/minio \
--namespace model-storage \
--create-namespace \
--set mode=standalone \
--set resources.requests.memory=4Gi \
--set persistence.size=500Gi
Access MinIO via the service endpoint for the cluster:
# Retrieve MinIO credentials
kubectl get secret minio -n model-storage -o jsonpath='{.data}' | base64 -d
# Configure mc client
kubectl run mc-client --rm -it --image=minio/mc --restart=Never -- \
mc alias set localai http://minio.model-storage:9000 \
"$(kubectl get secret minio -n model-storage -o jsonpath='{.data.accesskey}' | base64 -d)" \
"$(kubectl get secret minio -n model-storage -o jsonpath='{.data.secretkey}' | base64 -d)"
Storing and Retrieving Models
Store models using the S3 protocol with the model registry tool of choice:
# Download a GGUF model and upload to MinIO
curl -L -o /tmp/llama-3-8b.Q4_K_M.gguf \
"https://huggingface.co/NousResearch/Meta-Llama-3-8B-GGUF/resolve/main/llama-3-8b.Q4_K_M.gguf"
mc client run mc alias set localai http://minio.model-storage:9000 minioadmin minioadmin
mc client run mc cp /tmp/llama-3-8b.Q4_K_M.gguf localai/models/
Inference servers then pull models directly from the object store, avoiding per-node downloads.
Model Registry with HuggingFace Hub Integration
For teams using HuggingFace models, the HuggingFace Hub provides natural integration:
# Set HF token as Kubernetes secret
kubectl create secret generic hf-credentials \
--from-literal=HF_TOKEN=$(cat ~/.cache/huggingface/token)
# Use in inference deployment
apiVersion: v1
kind: Pod
metadata:
name: inference-hf
spec:
containers:
- name: inference
image: ghcr.io/gventroul팅enAI/llama.cpp:latest
env:
- name: HF_TOKEN
valueFrom:
secretKeyRef:
name: hf-credentials
key: HF_TOKEN
- name: MODEL_ID
value: "meta-llama/Meta-Llama-3-8B"
This pattern centralizes authentication and enables model access auditing.
Version Conventions
Coordinate model versions with semantic tagging or date-based versioning:
models/
llama-3-8b/
tags/
latest -> v2
v1 -> 2024-01-15/llama-3-8b.Q4_K_M.gguf
v2 -> 2024-03-20/llama-3-8b.Q5_K_M.gguf
2024-01-15/llama-3-8b.Q4_K_M.gguf
2024-03-20/llama-3-8b.Q5_K_M.gguf
Inference configs specify the tagged path, enabling A/B testing and gradual rollouts.
Deploy MinIO, upload a small model artifact (even a text file serves), configure a Kubernetes pod to read from MinIO using a Secret for credentials, and verify the mount reaches the model data.