15. Weights and Biases
Weights & Biases (wandb) provides a polished UI for experiment tracking with minimal code changes. It's the industry standard for teams that want fast iteration without infrastructure overhead.
Basic Wandb Integration
import wandb
def train_with_wandb(config):
wandb.init(
project="my-project",
entity="my-team",
name=config.run_name,
config={
"lr": config.lr,
"batch_size": config.batch_size,
"model": config.model_name
}
)
model = build_model(config)
optimizer = torch.optim.AdamW(model.parameters(), lr=config.lr)
for epoch in range(config.epochs):
train_loss = train_epoch(model, train_loader, optimizer)
val_loss = validate(model, val_loader)
wandb.log({
"train_loss": train_loss,
"val_loss": val_loss,
"epoch": epoch
})
wandb.finish()
Logging Artifacts and Media
# Log model checkpoints
artifact = wandb.Artifact("model", type="model")
artifact.add_file("best_model.pt")
wandb.log_artifact(artifact)
# Log images (for vision models)
images = wandb.Image(
sample_images,
caption="Predictions vs Ground Truth"
)
wandb.log({"samples": images})
# Log histograms (for gradient debugging)
wandb.log({"gradients": wandb.Histogram(model.grad_layer.histogram())})
Sweep for Hyperparameter Search
# sweep.yaml
method: bayes
metric:
name: val_loss
goal: minimize
parameters:
lr:
min: 1e-5
max: 1e-2
distribution: log_uniform
batch_size:
values: [16, 32, 64]
weight_decay:
min: 1e-6
max: 1e-2
# Initialize sweep
SWEEP_ID=$(wandb sweep sweep.yaml --project my-project | grep -oP ' sweep/ \K\S+')
# Start agent
wandb agent $SWEEP_ID
Wandb vs MLflow Tradeoffs
| Feature | WandB | MLflow |
|---|---|---|
| Setup complexity | None (hosted) | Self-hosted or Databricks |
| Cost | Free tier limited | Open-source |
| Hyperparameter search | Built-in sweeps | Integration required |
| Enterprise features | Paid | Open-source |
Local verification checkpoint
Run the smallest example from this chapter in a local workspace and record the package version, runtime, data path, and observed output. If the result depends on model size, vector count, CPU/GPU backend, or available memory, note that constraint beside the exercise so the lesson remains reproducible.
Local verification checkpoint
Run the smallest example from this chapter in a local workspace and record the package version, runtime, data path, and observed output. If the result depends on model size, vector count, CPU/GPU backend, or available memory, note that constraint beside the exercise so the lesson remains reproducible.
Integrate wandb into your training script. Run 3 experiments and compare results in the web UI.