How to create a Kubernetes operator for managing AI model deployments
Kubernetes cluster, kubebuilder or operator-sdk
What this does
This guide creates a Kubernetes operator using kubebuilder that automates the lifecycle of AI model deployments. The operator watches a custom resource ModelDeployment and reconciles the cluster state: when a new model is requested, the operator downloads weights, creates a PVC, deploys an inference server, and configures a Service. When the model is deprecated, the operator drains traffic and decommissions resources. This pattern eliminates manual toil for MLOps teams managing dozens of models.
Steps
Scaffold the operator project:
mkdir model-operator && cd model-operator kubebuilder init --domain ai.example.com --repo github.com/example/model-operatorCreate the
ModelDeploymentAPI:kubebuilder create api --group ai --version v1alpha1 --kind ModelDeployment --resource --controllerDefine the
ModelDeploymentspec inapi/v1alpha1/modeldeployment_types.go:type ModelDeploymentSpec struct { ModelName string `json:"modelName"` ModelSource string `json:"modelSource"` // HuggingFace repo or S3 URI GPUs int `json:"gpus,omitempty"` Replicas int `json:"replicas,omitempty"` AutoScaling bool `json:"autoScaling,omitempty"` } type ModelDeploymentStatus struct { Phase string `json:"phase"` // Pending, Downloading, Deploying, Running, Failed Endpoint string `json:"endpoint,omitempty"` ReadyPods int `json:"readyPods"` }Run code generation:
make generate && make manifestsImplement the controller reconciliation logic in
internal/controller/modeldeployment_controller.go. The reconcile loop handles four phases:func (r *ModelDeploymentReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) { var md aiv1alpha1.ModelDeployment if err := r.Get(ctx, req.NamespacedName, &md); err != nil { return ctrl.Result{}, client.IgnoreNotFound(err) } switch md.Status.Phase { case "": md.Status.Phase = "Downloading" r.Status().Update(ctx, &md) case "Downloading": if err := r.downloadWeights(ctx, &md); err != nil { return ctrl.Result{RequeueAfter: 30 * time.Second}, nil } md.Status.Phase = "Deploying" r.Status().Update(ctx, &md) case "Deploying": if err := r.createDeployment(ctx, &md); err != nil { return ctrl.Result{RequeueAfter: 10 * time.Second}, nil } md.Status.Phase = "Running" r.Status().Update(ctx, &md) case "Running": r.updateStatus(ctx, &md) } return ctrl.Result{RequeueAfter: 30 * time.Second}, nil }Implement the
createDeploymentmethod to generate a Kubernetes Deployment with GPU affinity and the appropriate vLLM args. Usecontroller-runtime'sCreateOrUpdateto handle idempotent resource management.Build and push the operator image:
make docker-build docker-push IMG=registry.example.com/model-operator:v0.1.0Deploy the operator to the cluster:
make deploy IMG=registry.example.com/model-operator:v0.1.0 kubectl get pods -n model-operator-systemCreate a
ModelDeploymentcustom resource to test:apiVersion: ai.example.com/v1alpha1 kind: ModelDeployment metadata: name: llama-3-8b spec: modelName: "Llama-3-8B" modelSource: "huggingface://meta-llama/Meta-Llama-3-8B-Instruct" gpus: 1 replicas: 1Apply:
kubectl apply -f model.yaml.Watch the operator reconcile:
kubectl get modeldeployment llama-3-8b -wExpected output: Phase transitions from Pending to Downloading to Deploying to Running.
Verification
kubectl get modeldeployment llama-3-8b -o json | jq '.status.phase'
Expected output: "Running".
Common failures
- Controller crash-loops on API version mismatch — ensure the CRD YAML is regenerated after schema changes:
make manifests && make install. - Model download fails silently — the weights downloader container needs network access to HuggingFace or S3. Check operator logs:
kubectl logs -n model-operator-system deployment/controller-manager. - Phase stuck at "Deploying" — the generated Deployment may fail scheduling due to GPU unavailability. Check the Deployment status:
kubectl describe deployment -l modeldeployment=llama-3-8b.