12. Optimizers and Schedulers
The optimizer and learning rate scheduler are as important as the model architecture. Wrong choices cause divergence, slow convergence, or poor generalization.
Common Optimizers
# AdamW - decay weights, not biases/batch norm
optimizer = torch.optim.AdamW(
model.parameters(),
lr=1e-3,
weight_decay=0.01, # Decoupled from gradient update
betas=(0.9, 0.999),
eps=1e-8
)
# SGD with momentum - often better generalization
optimizer = torch.optim.SGD(
model.parameters(),
lr=0.1,
momentum=0.9,
weight_decay=1e-4,
nesterov=True
)
# AdamW with decoupled weight decay from Lion (simpler, often equivalent)
# pip install lion-pytorch
from lion_pytorch import Lion
optimizer = Lion(model.parameters(), lr=1e-4, weight_decay=1e-2)
Learning Rate Schedulers
# Step decay - simple, works well
scheduler = torch.optim.lr_scheduler.StepLR(
optimizer, step_size=30, gamma=0.1
)
# Cosine annealing - smooth decay
scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(
optimizer, T_max=50, eta_min=1e-6
)
# Cosine with warm restarts
scheduler = torch.optim.lr_scheduler.CosineAnnealingWarmRestarts(
optimizer, T_0=10, T_mult=2, eta_min=1e-6
)
# OneCycleLR - usually best for quick convergence
scheduler = torch.optim.lr_scheduler.OneCycleLR(
optimizer,
max_lr=1e-3,
epochs=50,
steps_per_epoch=len(train_loader),
pct_start=0.3 # 30% warmup
)
Warmup is Essential
Transformers without warmup diverge because gradients are extreme early in training:
class WarmupScheduler:
def __init__(self, optimizer, warmup_steps):
self.optimizer = optimizer
self.warmup_steps = warmup_steps
self.base_lr = optimizer.param_groups[0]['lr']
self.step_count = 0
def step(self):
self.step_count += 1
if self.step_count <= self.warmup_steps:
lr = self.base_lr * self.step_count / self.warmup_steps
for param_group in self.optimizer.param_groups:
param_group['lr'] = lr
else:
pass # Hand off to main scheduler
Local verification checkpoint
Run the smallest example from this chapter in a local workspace and record the package version, runtime, data path, and observed output. If the result depends on model size, vector count, CPU/GPU backend, or available memory, note that constraint beside the exercise so the lesson remains reproducible.
Local verification checkpoint
Run the smallest example from this chapter in a local workspace and record the package version, runtime, data path, and observed output. If the result depends on model size, vector count, CPU/GPU backend, or available memory, note that constraint beside the exercise so the lesson remains reproducible.
Plot learning rate vs. training step for OneCycleLR and CosineAnnealing on the same axes. Observe how OneCycleLR's max_lr phase affects dynamics.