DPO Hyperparameters — RLHF, DPO, and PPO (Chapter 5)

DPO has fewer hyperparameters than PPO, but the ones that matter have non-obvious effects on training dynamics.

beta (KL coefficient): This controls the strength of the implicit KL penalty. The default of 0.1 works well for most cases, but the optimal value depends on your reward model quality and how much you're willing to deviate from the reference.

Too high (>0.3): Policy barely changes, alignment improvement is minimal
Too low (<0.05): Policy chases reward aggressively, risk of reward hacking
Asymmetric effect: Increasing beta is safer than decreasing it—over-aligned models can be hard to recover

learning_rate: DPO is more sensitive to learning rate than SFT. Because the loss depends on log-ratio differences, large gradient updates can destabilize the policy. Typical range: 5e-7 to 1e-6 for models larger than 7B parameters.

# Beta sweep example
for beta in [0.05, 0.1, 0.2, 0.5]:
    args = TrainingArguments(
        output_dir=f"./dpo_beta_{beta}",
        learning_rate=5e-7,
        beta=beta,
        # ... other args
    )
    # Evaluate each on held-out preference test set

num_train_epochs: Unlike SFT where more epochs often helps, DPO can overfit to preference noise. The optimal epoch count depends on dataset size and noise level. With noisy preference data (50-60% agreement), 1-2 epochs is often sufficient. With cleaner data, you might use 3-5 epochs.

gradient_accumulation_steps: Higher values let you simulate larger effective batch sizes, which stabilizes updates. The downside: you update less frequently, which can make learning slower in wall-clock time even if sample efficiency is higher.

label_smoothing_factor: This is an underutilized parameter. It treats "partially preferred" responses as not fully preferred, which helps with noisy labels. Values of 0.1-0.2 can significantly improve resistance to annotation noise.