RUNLOCALAIv38
->Will it run?Best GPUCompareTroubleshootStartLearnPulseModelsHardwareToolsBench
Run check
RUNLOCALAI

Independently operated catalog for local-AI hardware and software. Hand-written verdicts. Source-cited claims. Reproducible commands when we have them.

OP·Fredoline Eruo
DIR
  • Models
  • Hardware
  • Tools
  • Benchmarks
TOOLS
  • Will it run?
  • Compare hardware
  • Cost vs cloud
  • Choose my GPU
  • Prompting kits
  • Quick answers
REF
  • All buyer guides
  • Learn local AI
  • Methodology
  • Glossary
  • Errors KB
  • Trust
EDITOR
  • About
  • Author
  • How we make money
  • Editorial policy
  • Contact
LEGAL
  • Privacy
  • Terms
  • Sitemap
MAIL · MONTHLY DIGEST
Get monthly local AI changes
Monthly recap. No spam.
DISCLOSURE

Some links on this site are affiliate links (Amazon Associates and other first-class retailers). When you buy through them, we earn a small commission at no extra cost to you. Affiliate links do not influence our verdicts — there are cards we rate highly that we don't have affiliate relationships with, and cards that sell well that we refuse to recommend. Read more →

© 2026 runlocalai.coIndependently operated
RUNLOCALAI · v38
  1. >
  2. Home
  3. /Learn
  4. /Courses
  5. /RLHF, DPO, and PPO
  6. /Ch. 5
RLHF, DPO, and PPO

05. DPO Hyperparameters

Chapter 5 of 24 · 15 min
KEY INSIGHT

The interaction between beta and learning_rate is critical. A high learning rate with low beta is a recipe for reward hacking—watch for sudden jumps in the reward metric during training. If you see the metric improving too quickly (more than 10% in a single epoch), your learning rate is probably too high.

DPO has fewer hyperparameters than PPO, but the ones that matter have non-obvious effects on training dynamics.

beta (KL coefficient): This controls the strength of the implicit KL penalty. The default of 0.1 works well for most cases, but the optimal value depends on your reward model quality and how much you're willing to deviate from the reference.

  • Too high (>0.3): Policy barely changes, alignment improvement is minimal
  • Too low (<0.05): Policy chases reward aggressively, risk of reward hacking
  • Asymmetric effect: Increasing beta is safer than decreasing it—over-aligned models can be hard to recover

learning_rate: DPO is more sensitive to learning rate than SFT. Because the loss depends on log-ratio differences, large gradient updates can destabilize the policy. Typical range: 5e-7 to 1e-6 for models larger than 7B parameters.

# Beta sweep example
for beta in [0.05, 0.1, 0.2, 0.5]:
    args = TrainingArguments(
        output_dir=f"./dpo_beta_{beta}",
        learning_rate=5e-7,
        beta=beta,
        # ... other args
    )
    # Evaluate each on held-out preference test set

num_train_epochs: Unlike SFT where more epochs often helps, DPO can overfit to preference noise. The optimal epoch count depends on dataset size and noise level. With noisy preference data (50-60% agreement), 1-2 epochs is often sufficient. With cleaner data, you might use 3-5 epochs.

gradient_accumulation_steps: Higher values let you simulate larger effective batch sizes, which stabilizes updates. The downside: you update less frequently, which can make learning slower in wall-clock time even if sample efficiency is higher.

label_smoothing_factor: This is an underutilized parameter. It treats "partially preferred" responses as not fully preferred, which helps with noisy labels. Values of 0.1-0.2 can significantly improve resistance to annotation noise.

EXERCISE

Run a hyperparameter sweep over beta and learning_rate using Weights & Biases or similar. Use at least 3 values for each parameter (9 total runs). Track both the training loss and a held-out evaluation metric. Plot the interaction effect and identify the sweet spot for your specific dataset.

← Chapter 4
DPO Implementation with TRL
Chapter 6 →
Reward Model Training