05. DPO Hyperparameters
DPO has fewer hyperparameters than PPO, but the ones that matter have non-obvious effects on training dynamics.
beta (KL coefficient): This controls the strength of the implicit KL penalty. The default of 0.1 works well for most cases, but the optimal value depends on your reward model quality and how much you're willing to deviate from the reference.
- Too high (>0.3): Policy barely changes, alignment improvement is minimal
- Too low (<0.05): Policy chases reward aggressively, risk of reward hacking
- Asymmetric effect: Increasing beta is safer than decreasing it—over-aligned models can be hard to recover
learning_rate: DPO is more sensitive to learning rate than SFT. Because the loss depends on log-ratio differences, large gradient updates can destabilize the policy. Typical range: 5e-7 to 1e-6 for models larger than 7B parameters.
# Beta sweep example
for beta in [0.05, 0.1, 0.2, 0.5]:
args = TrainingArguments(
output_dir=f"./dpo_beta_{beta}",
learning_rate=5e-7,
beta=beta,
# ... other args
)
# Evaluate each on held-out preference test set
num_train_epochs: Unlike SFT where more epochs often helps, DPO can overfit to preference noise. The optimal epoch count depends on dataset size and noise level. With noisy preference data (50-60% agreement), 1-2 epochs is often sufficient. With cleaner data, you might use 3-5 epochs.
gradient_accumulation_steps: Higher values let you simulate larger effective batch sizes, which stabilizes updates. The downside: you update less frequently, which can make learning slower in wall-clock time even if sample efficiency is higher.
label_smoothing_factor: This is an underutilized parameter. It treats "partially preferred" responses as not fully preferred, which helps with noisy labels. Values of 0.1-0.2 can significantly improve resistance to annotation noise.
Run a hyperparameter sweep over beta and learning_rate using Weights & Biases or similar. Use at least 3 values for each parameter (9 total runs). Track both the training loss and a held-out evaluation metric. Plot the interaction effect and identify the sweet spot for your specific dataset.