RUNLOCALAIv38
->Will it run?Best GPUCompareTroubleshootStartLearnPulseModelsHardwareToolsBench
Run check
RUNLOCALAI

Independently operated catalog for local-AI hardware and software. Hand-written verdicts. Source-cited claims. Reproducible commands when we have them.

OP·Fredoline Eruo
DIR
  • Models
  • Hardware
  • Tools
  • Benchmarks
TOOLS
  • Will it run?
  • Compare hardware
  • Cost vs cloud
  • Choose my GPU
  • Prompting kits
  • Quick answers
REF
  • All buyer guides
  • Learn local AI
  • Methodology
  • Glossary
  • Errors KB
  • Trust
EDITOR
  • About
  • Author
  • How we make money
  • Editorial policy
  • Contact
LEGAL
  • Privacy
  • Terms
  • Sitemap
MAIL · MONTHLY DIGEST
Get monthly local AI changes
Monthly recap. No spam.
DISCLOSURE

Some links on this site are affiliate links (Amazon Associates and other first-class retailers). When you buy through them, we earn a small commission at no extra cost to you. Affiliate links do not influence our verdicts — there are cards we rate highly that we don't have affiliate relationships with, and cards that sell well that we refuse to recommend. Read more →

© 2026 runlocalai.coIndependently operated
RUNLOCALAI · v38
  1. >
  2. Home
  3. /Learn
  4. /How-to
  5. /How to create a Kubernetes operator for managing AI model deployments
HOW-TO · OPS

How to create a Kubernetes operator for managing AI model deployments

advanced·1h·By Fredoline Eruo
Target environment
Ubuntu 24.04 · Ollama 0.4.x
PREREQUISITES

Kubernetes cluster, kubebuilder or operator-sdk

What this does

This guide creates a Kubernetes operator using kubebuilder that automates the lifecycle of AI model deployments. The operator watches a custom resource ModelDeployment and reconciles the cluster state: when a new model is requested, the operator downloads weights, creates a PVC, deploys an inference server, and configures a Service. When the model is deprecated, the operator drains traffic and decommissions resources. This pattern eliminates manual toil for MLOps teams managing dozens of models.

Steps

  1. Scaffold the operator project:

    mkdir model-operator && cd model-operator
    kubebuilder init --domain ai.example.com --repo github.com/example/model-operator
    
  2. Create the ModelDeployment API:

    kubebuilder create api --group ai --version v1alpha1 --kind ModelDeployment --resource --controller
    
  3. Define the ModelDeployment spec in api/v1alpha1/modeldeployment_types.go:

    type ModelDeploymentSpec struct {
        ModelName    string `json:"modelName"`
        ModelSource  string `json:"modelSource"`  // HuggingFace repo or S3 URI
        GPUs         int    `json:"gpus,omitempty"`
        Replicas     int    `json:"replicas,omitempty"`
        AutoScaling  bool   `json:"autoScaling,omitempty"`
    }
    type ModelDeploymentStatus struct {
        Phase      string `json:"phase"` // Pending, Downloading, Deploying, Running, Failed
        Endpoint   string `json:"endpoint,omitempty"`
        ReadyPods  int    `json:"readyPods"`
    }
    
  4. Run code generation:

    make generate && make manifests
    
  5. Implement the controller reconciliation logic in internal/controller/modeldeployment_controller.go. The reconcile loop handles four phases:

    func (r *ModelDeploymentReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {
        var md aiv1alpha1.ModelDeployment
        if err := r.Get(ctx, req.NamespacedName, &md); err != nil {
            return ctrl.Result{}, client.IgnoreNotFound(err)
        }
        switch md.Status.Phase {
        case "":
            md.Status.Phase = "Downloading"
            r.Status().Update(ctx, &md)
        case "Downloading":
            if err := r.downloadWeights(ctx, &md); err != nil {
                return ctrl.Result{RequeueAfter: 30 * time.Second}, nil
            }
            md.Status.Phase = "Deploying"
            r.Status().Update(ctx, &md)
        case "Deploying":
            if err := r.createDeployment(ctx, &md); err != nil {
                return ctrl.Result{RequeueAfter: 10 * time.Second}, nil
            }
            md.Status.Phase = "Running"
            r.Status().Update(ctx, &md)
        case "Running":
            r.updateStatus(ctx, &md)
        }
        return ctrl.Result{RequeueAfter: 30 * time.Second}, nil
    }
    
  6. Implement the createDeployment method to generate a Kubernetes Deployment with GPU affinity and the appropriate vLLM args. Use controller-runtime's CreateOrUpdate to handle idempotent resource management.

  7. Build and push the operator image:

    make docker-build docker-push IMG=registry.example.com/model-operator:v0.1.0
    
  8. Deploy the operator to the cluster:

    make deploy IMG=registry.example.com/model-operator:v0.1.0
    kubectl get pods -n model-operator-system
    
  9. Create a ModelDeployment custom resource to test:

    apiVersion: ai.example.com/v1alpha1
    kind: ModelDeployment
    metadata:
      name: llama-3-8b
    spec:
      modelName: "Llama-3-8B"
      modelSource: "huggingface://meta-llama/Meta-Llama-3-8B-Instruct"
      gpus: 1
      replicas: 1
    

    Apply: kubectl apply -f model.yaml.

  10. Watch the operator reconcile:

    kubectl get modeldeployment llama-3-8b -w
    

    Expected output: Phase transitions from Pending to Downloading to Deploying to Running.

Verification

kubectl get modeldeployment llama-3-8b -o json | jq '.status.phase'

Expected output: "Running".

Common failures

  • Controller crash-loops on API version mismatch — ensure the CRD YAML is regenerated after schema changes: make manifests && make install.
  • Model download fails silently — the weights downloader container needs network access to HuggingFace or S3. Check operator logs: kubectl logs -n model-operator-system deployment/controller-manager.
  • Phase stuck at "Deploying" — the generated Deployment may fail scheduling due to GPU unavailability. Check the Deployment status: kubectl describe deployment -l modeldeployment=llama-3-8b.

Related guides

  • Manage AI model weights with Kubernetes Persistent Volumes
  • Horizontal pod autoscaling for AI inference services
  • Deploy vLLM on Kubernetes with GPU node selection
← All how-to guidesCourses →