12. Model Repository

Chapter 12 of 18 · 20 min

A model repository provides centralized storage, versioning, and distribution for AI models across cluster nodes, eliminating redundant downloads and enabling rapid inference deployment.

Architecture: Storing on a Filesystem vs Object Storage

Small clusters often use shared filesystem storage (NFS, CephFS) for simplicity. Object storage (MinIO, S3) scales better for large model archives and provides built-in versioning:

# Deploy MinIO for S3-compatible object storage
helm install minio minio/minio \
  --namespace model-storage \
  --create-namespace \
  --set mode=standalone \
  --set resources.requests.memory=4Gi \
  --set persistence.size=500Gi

Access MinIO via the service endpoint for the cluster:

# Retrieve MinIO credentials
kubectl get secret minio -n model-storage -o jsonpath='{.data}' | base64 -d

# Configure mc client
kubectl run mc-client --rm -it --image=minio/mc --restart=Never -- \
  mc alias set localai http://minio.model-storage:9000 \
  "$(kubectl get secret minio -n model-storage -o jsonpath='{.data.accesskey}' | base64 -d)" \
  "$(kubectl get secret minio -n model-storage -o jsonpath='{.data.secretkey}' | base64 -d)"

Storing and Retrieving Models

Store models using the S3 protocol with the model registry tool of choice:

# Download a GGUF model and upload to MinIO
curl -L -o /tmp/llama-3-8b.Q4_K_M.gguf \
  "https://huggingface.co/NousResearch/Meta-Llama-3-8B-GGUF/resolve/main/llama-3-8b.Q4_K_M.gguf"

mc client run mc alias set localai http://minio.model-storage:9000 minioadmin minioadmin

mc client run mc cp /tmp/llama-3-8b.Q4_K_M.gguf localai/models/

Inference servers then pull models directly from the object store, avoiding per-node downloads.

Model Registry with HuggingFace Hub Integration

For teams using HuggingFace models, the HuggingFace Hub provides natural integration:

# Set HF token as Kubernetes secret
kubectl create secret generic hf-credentials \
  --from-literal=HF_TOKEN=$(cat ~/.cache/huggingface/token)

# Use in inference deployment
apiVersion: v1
kind: Pod
metadata:
  name: inference-hf
spec:
  containers:
  - name: inference
    image: ghcr.io/gventroul팅enAI/llama.cpp:latest
    env:
    - name: HF_TOKEN
      valueFrom:
        secretKeyRef:
          name: hf-credentials
          key: HF_TOKEN
    - name: MODEL_ID
      value: "meta-llama/Meta-Llama-3-8B"

This pattern centralizes authentication and enables model access auditing.

Version Conventions

Coordinate model versions with semantic tagging or date-based versioning:

models/
  llama-3-8b/
    tags/
      latest -> v2
      v1 -> 2024-01-15/llama-3-8b.Q4_K_M.gguf
      v2 -> 2024-03-20/llama-3-8b.Q5_K_M.gguf
    2024-01-15/llama-3-8b.Q4_K_M.gguf
    2024-03-20/llama-3-8b.Q5_K_M.gguf

Inference configs specify the tagged path, enabling A/B testing and gradual rollouts.

EXERCISE

Deploy MinIO, upload a small model artifact (even a text file serves), configure a Kubernetes pod to read from MinIO using a Secret for credentials, and verify the mount reaches the model data.