Weights and Biases — Custom Training Pipelines (Chapter 15)

Weights & Biases (wandb) provides a polished UI for experiment tracking with minimal code changes. It's the industry standard for teams that want fast iteration without infrastructure overhead.

Basic Wandb Integration

import wandb

def train_with_wandb(config):
    wandb.init(
        project="my-project",
        entity="my-team",
        name=config.run_name,
        config={
            "lr": config.lr,
            "batch_size": config.batch_size,
            "model": config.model_name
        }
    )
    
    model = build_model(config)
    optimizer = torch.optim.AdamW(model.parameters(), lr=config.lr)
    
    for epoch in range(config.epochs):
        train_loss = train_epoch(model, train_loader, optimizer)
        val_loss = validate(model, val_loader)
        
        wandb.log({
            "train_loss": train_loss,
            "val_loss": val_loss,
            "epoch": epoch
        })
    
    wandb.finish()

Logging Artifacts and Media

# Log model checkpoints
artifact = wandb.Artifact("model", type="model")
artifact.add_file("best_model.pt")
wandb.log_artifact(artifact)

# Log images (for vision models)
images = wandb.Image(
    sample_images,
    caption="Predictions vs Ground Truth"
)
wandb.log({"samples": images})

# Log histograms (for gradient debugging)
wandb.log({"gradients": wandb.Histogram(model.grad_layer.histogram())})

Sweep for Hyperparameter Search

# sweep.yaml
method: bayes
metric:
  name: val_loss
  goal: minimize
parameters:
  lr:
    min: 1e-5
    max: 1e-2
    distribution: log_uniform
  batch_size:
    values: [16, 32, 64]
  weight_decay:
    min: 1e-6
    max: 1e-2

# Initialize sweep
SWEEP_ID=$(wandb sweep sweep.yaml --project my-project | grep -oP ' sweep/ \K\S+')

# Start agent
wandb agent $SWEEP_ID

Wandb vs MLflow Tradeoffs

Feature	WandB	MLflow
Setup complexity	None (hosted)	Self-hosted or Databricks
Cost	Free tier limited	Open-source
Hyperparameter search	Built-in sweeps	Integration required
Enterprise features	Paid	Open-source

Local verification checkpoint

Run the smallest example from this chapter in a local workspace and record the package version, runtime, data path, and observed output. If the result depends on model size, vector count, CPU/GPU backend, or available memory, note that constraint beside the exercise so the lesson remains reproducible.