11. Loss Functions
Chapter 11 of 18 · 20 min
Loss functions encode your inductive bias about what the model should learn. A wrong loss function produces a model that optimizes the wrong objective, regardless of training quality.
Classification Losses
import torch
import torch.nn.functional as F
def classification_loss(outputs, targets, config):
# Standard cross-entropy
if config.loss == "ce":
return F.cross_entropy(outputs, targets)
# Label smoothing for better calibration
if config.loss == "label_smoothing":
return F.cross_entropy(outputs, targets, label_smoothing=0.1)
# Focal loss for class imbalance
if config.loss == "focal":
ce_loss = F.cross_entropy(outputs, targets, reduction='none')
pt = torch.exp(-ce_loss)
focal_loss = (1 - pt) ** 2 * ce_loss
return focal_loss.mean()
Regression Losses
def regression_loss(outputs, targets, config):
# L1 (MAE) - resistant to outliers
if config.loss == "l1":
return F.l1_loss(outputs, targets)
# L2 (MSE) - penalizes large errors more
if config.loss == "mse":
return F.mse_loss(outputs, targets)
# Huber - L1 near zero, L2 for large errors
if config.loss == "huber":
return F.smooth_l1_loss(outputs, targets, beta=1.0)
# Quantile loss for uncertainty estimation
if config.loss == "quantile":
quantiles = [0.1, 0.5, 0.9]
losses = []
for i, q in enumerate(quantiles):
errors = targets - outputs[:, i]
losses.append(torch.max((q - 1) * errors, q * errors))
return sum(losses) / len(quantiles)
Multi-Task Losses
Combining losses requires careful weighting:
class MultiTaskLoss(nn.Module):
def __init__(self, tasks, init_weights=None):
super().__init__()
self.tasks = tasks
if init_weights is None:
init_weights = {t: 1.0 for t in tasks}
self.log_vars = nn.Parameter(torch.tensor([init_weights[t] for t in tasks]))
def forward(self, outputs, targets):
total_loss = 0
for i, task in enumerate(self.tasks):
precision = torch.exp(-self.log_vars[i])
loss = ((outputs[task] - targets[task]) ** 2) * precision + self.log_vars[i]
total_loss += loss
return total_loss
Local verification checkpoint
Run the smallest example from this chapter in a local workspace and record the package version, runtime, data path, and observed output. If the result depends on model size, vector count, CPU/GPU backend, or available memory, note that constraint beside the exercise so the lesson remains reproducible.
EXERCISE
Run training with both L1 and L2 losses on your regression dataset. Compare the prediction distributions. Do they differ? Which is more appropriate for your downstream task?