02. Experiment Tracking

Chapter 2 of 24 · 15 min

KEY INSIGHT

Parameters and metrics are the visible layer. The invisible layer is the computational graph—your code, dependencies, and data. Capture enough context to reproduce a run without depending on institutional memory. Parameters flow in both directions. Hyperparameters (learning rate, batch size) are inputs you control. Learned parameters (weights) are outputs you measure. But the real value emerges when you track all parameters: data paths, feature flags, random seeds, hardware configurations. A single run with a bad seed can produce wildly different results. Artifacts are the outputs worth keeping: model binaries, processed datasets, visualizations, serialized preprocessors. MLflow and similar tools store artifacts in designated locations, making retrieval deterministic. ```python # Minimal experiment tracking with MLflow import mlflow import mlflow.sklearn from sklearn.ensemble import RandomForestClassifier from sklearn.metrics import accuracy_score mlflow.set_experiment("spam-classifier-v2") with mlflow.start_run(run_name="baseline-rf"): # Log parameters mlflow.log_param("n_estimators", 100) mlflow.log_param("max_depth", 10) mlflow.log_param("random_seed", 42) # Train model = RandomForestClassifier(n_estimators=100, max_depth=10, random_state=42) model.fit(X_train, y_train) # Evaluate and log metrics preds = model.predict(X_test) accuracy = accuracy_score(y_test, preds) mlflow.log_metric("accuracy", accuracy) # Log model artifact mlflow.sklearn.log_model(model, "model") ``` This pattern—log params, train, evaluate, log metrics, save model—forms the foundation of every experiment tracking workflow.

Experiment tracking captures the context of machine learning development. Without it, you're flying blind—unable to compare runs, reproduce successes, or diagnose failures. Every training run is an experiment, and experiments need logs.

The fundamental unit is the run: a single execution of training code that produces metrics, artifacts, and metadata. A run captures what you trained (parameters), how well it trained (metrics), what it produced (model artifacts), and the context (data version, environment). Later, you can query runs to find the best-performing model for a given scenario.

Metrics are the backbone of comparison. Track loss curves, accuracy curves, and custom business metrics. The trap is tracking too many metrics without understanding what matters. Define your primary metric before training—it's your optimization target. Secondary metrics are for context and debugging, not decision-making.

EXERCISE

Run the above code with MLflow. Navigate to the MLflow UI (mlflow ui) and locate your run. Note the automatically-captured source code, parameters, and metrics. Modify hyperparameters and run again; compare the two runs in the UI.