Training & optimization

Regularization

Regularization is a set of techniques used during model training to prevent overfitting—where the model memorizes training data instead of learning general patterns. It works by adding a penalty to the loss function or modifying the training process to constrain model complexity. Common forms include L1 (Lasso) and L2 (Ridge) weight decay, which shrink large weights, and dropout, which randomly disables neurons during training. For operators, regularization matters because it directly affects how well a model generalizes to new data; a model trained with proper regularization will produce more reliable outputs on diverse prompts.

Deeper dive

Regularization addresses the bias-variance tradeoff: too little regularization leads to high variance (overfitting), while too much leads to high bias (underfitting). L2 regularization (weight decay) adds a term proportional to the sum of squared weights to the loss, encouraging smaller, more distributed weights. L1 regularization adds a term proportional to the sum of absolute weights, promoting sparsity (many weights become zero). Dropout, introduced by Srivastava et al., randomly drops a fraction of neurons each training step, forcing the network to learn redundant representations. In practice, training scripts often include a weight_decay hyperparameter (e.g., in PyTorch's optim.AdamW). For local AI operators, understanding regularization helps when fine-tuning models: too little weight decay can cause the fine-tuned model to overfit to the small dataset, while too much can wash out the pre-trained knowledge.

Practical example

When fine-tuning Llama 3.1 8B on a custom dataset using LoRA, operators typically set a weight decay of 0.01 to 0.1 in the optimizer (e.g., AdamW). If weight decay is too low (e.g., 0.0), the model might memorize the 1000 training examples and fail on new prompts. If too high (e.g., 1.0), the model may lose its pre-trained capabilities, producing incoherent text. A common starting point is 0.01, which balances generalization and retention.

Workflow example

In a Hugging Face Transformers training script, regularization is configured via the TrainingArguments class: training_args = TrainingArguments(..., weight_decay=0.01). In PyTorch, it's set in the optimizer: optimizer = torch.optim.AdamW(model.parameters(), lr=5e-5, weight_decay=0.01). For dropout, operators add config.dropout = 0.1 in the model configuration before training. When using LoRA with PEFT, the base model's dropout is often left as-is, but the LoRA adapters can have their own dropout (e.g., lora_dropout=0.05).

Reviewed by Fredoline Eruo. See our editorial policy.

Buyer guides

When it doesn't work