06. Model Versioning
Model versioning goes beyond the registry's version numbers. True versioning captures the complete lineage: which data, code, and configuration produced each model. This enables root cause analysis when models misbehave and ensures reproducibility when regulatory requirements demand it.
Git captures code version. DVC (Data Version Control) extends this to data. Together, they track the complete experiment context.
# Initialize DVC
dvc init
# Track a dataset
dvc add data/training-data.csv
git add data/training-data.csv.dvc .gitignore
git commit -m "Add training dataset v2.1"
# Create a reproducible training pipeline
dvc run -n train \
-d data/training-data.csv \
-d src/train.py \
-o models/v1/model.pkl \
python src/train.py --data data/training-data.csv
Local verification checkpoint
Run the smallest example from this chapter in a local workspace and record the package version, runtime, data path, and observed output. If the result depends on model size, vector count, CPU/GPU backend, or available memory, note that constraint beside the exercise so the lesson remains reproducible.
Initialize a git repository for an ML project. Commit your training script. Run training and record the git commit hash in MLflow. Modify the script, commit, and train again. Use MLflow UI to compare both runs and verify lineage is clear.