MLOps & deployment

MLOps

MLOps (Machine Learning Operations) is the practice of managing the lifecycle of machine learning models from development to production deployment and monitoring. For operators running local AI, MLOps involves tasks like versioning model weights, automating quantized model builds (e.g., converting a Hugging Face model to GGUF with llama.cpp), and tracking performance across hardware configurations. It ensures that models can be reliably reproduced, updated, and monitored for drift or degradation. Key tools include DVC for data/model versioning, MLflow for experiment tracking, and GitHub Actions for CI/CD pipelines that rebuild quantized models when source weights change.

Deeper dive

MLOps extends DevOps principles to ML workflows. The core loop includes: data preparation (cleaning, labeling), model training (often on remote GPUs), evaluation, packaging (e.g., quantizing to Q4_K_M GGUF), deployment (e.g., serving via Ollama or vLLM), and monitoring (tracking inference latency, token throughput, output quality). For local AI operators, MLOps often means automating the pipeline that takes a new model release from Hugging Face, runs quantization scripts, benchmarks on target hardware, and pushes the artifact to a local registry. Without MLOps, operators manually track which quantization level works on which GPU, leading to wasted VRAM or broken configs. Common pitfalls include version skew between training code and inference runtime, and unmonitored model drift when fine-tuned models degrade on new data.

Practical example

An operator maintains a set of GGUF models for an RTX 3090. Using MLOps, they set up a GitHub Action that triggers when the original model on Hugging Face updates. The action runs llama.cpp/convert.py to produce Q4_K_M and Q8_0 versions, benchmarks them with llama-bench, and uploads the best-performing quant to a local S3 bucket. A separate workflow tests that the new model loads in Ollama without error. This ensures the operator never deploys a broken quant or misses a security patch.

Workflow example

When running ollama pull llama3.1:8b, the operator may later want to update to a newer fine-tune. With MLOps, they would: (1) clone the model repo, (2) run llama.cpp/quantize with a script that logs parameters to MLflow, (3) benchmark with llama-bench -m model.gguf -p 512 -n 256 on their RTX 4060, (4) if tokens/sec is acceptable, push the new GGUF to a model registry and update the Ollama Modelfile. The workflow ensures reproducibility and rollback capability.

Reviewed by Fredoline Eruo. See our editorial policy.

Buyer guides

When it doesn't work