What this does

Setting up model versioning and rollback creates a safety net for AI model deployments. Each model version (quantized weights, LoRA adapters, or full fine-tunes) is stored as an immutable artifact with a unique version identifier. A deployment registry points to the active version for each environment. When a new model version is deployed and proves problematic—degraded output quality, increased latency, or errors—the rollback mechanism reverts the active pointer to the previous known-good version within seconds, without rebuilding containers or restarting infrastructure.

Steps

Define the version naming convention. Use semantic versioning with a model identifier: llama3-instruct-v2.1.0-q4km. The format is {model_name}-v{major}.{minor}.{patch}-{quantization}. Store artifacts in a structured path: s3://models/{model_name}/v{major}.{minor}.{patch}/{filename}.gguf. For each new version, upload the model file and a metadata JSON: {"version": "2.1.0", "base_model": "llama3-instruct", "quantization": "q4km", "benchmark_score": 0.87, "created_at": "2026-05-29T10:00:00Z", "parent_version": "2.0.0"}. Create a deployment registry: a simple JSON file or database table that maps environments to active versions. { "production": {"model": "llama3-instruct", "version": "2.0.0"}, "staging": {"model": "llama3-instruct", "version": "2.1.0"} }. On the inference server, implement a model loader that reads the active version from the registry and downloads or loads the corresponding artifact. For a Kubernetes deployment, store the active version in a ConfigMap: kubectl create configmap model-registry --from-literal=active_version=v2.0.0. The inference pod reads this on startup and pulls the correct model. Implement the deployment script: def deploy(model_name, version, env): download_model(model_name, version); validate_model(model_name, version); run_smoke_tests(model_name, version); update_registry(env, model_name, version);. The validate_model step runs a set of benchmark prompts and compares metrics (perplexity, latency, accuracy on a test set) against the previous version. If metrics degrade by more than 5%, abort deployment. Implement rollback: def rollback(env): history = get_deployment_history(env); previous = history[-2]; update_registry(env, previous.model, previous.version); reload_model_server(). The rollback must complete in under 30 seconds. Maintain a deployment history table: CREATE TABLE deployments (id SERIAL, env TEXT, model_name TEXT, version TEXT, deployed_at TIMESTAMP, rolled_back BOOLEAN DEFAULT FALSE, metrics JSONB). Store performance metrics with each deployment for trend analysis.

Record the local run evidence. Save the exact command, runtime or package version, model name if applicable, and observed output so the result can be reproduced later.
Confirm the local starting state. Print the active binary, package version, model name, or configuration path before changing the workflow.
Run the smallest complete path. Execute the minimum command or script that proves the guide works end to end on the local machine.
Compare against expected output. Check the final line, status code, generated artifact, or model response against the verification section before expanding the setup.
Record the local run evidence. Save the exact command, runtime or package version, model name if applicable, and observed output so the result can be reproduced later.

Verification

Deploy v2.0.0 and verify the inference server loads it: check the model info endpoint returns the correct version. Deploy v2.1.0 and verify the server switches to the new version. Run rollback to v2.0.0 and verify the server reverts within 30 seconds by checking the model info endpoint again. Test the validation gate: create a deliberately bad model that fails benchmark thresholds and verify deployment is aborted. Check the deployment history table for correct records of all deploy and rollback events.

Common failures

Rollback references deleted artifact: Never delete old model artifacts from storage—only mark them as superseded. ConfigMap update not picked up by running pods: The inference pod must re-read the ConfigMap on each health check or use a file watcher; for Kubernetes, trigger a rolling restart with kubectl rollout restart deployment/inference. Concurrent deployments corrupt registry state: Use optimistic locking or a database transaction with SELECT ... FOR UPDATE when updating the active version. Model download fails mid-deployment leaving server in broken state: Download to a staging path first (/models/staging/), then atomically move or symlink to the active path. Performance metrics not comparable between quantizations: Normalize metrics by quantization level or compare only within the same quantization family.

Version mismatch - The installed package or runtime differs from the command shown; check the version first and rerun the smallest verification command.
Local environment drift - Another service, virtual environment, model, or path is being used; print the active binary path and configuration before changing the guide steps.

Related guides

deploy-production-pd-hyperspace
implement-ab-testing-model-responses
setup-auto-scaling-llm-inference