DPO Theory — RLHF, DPO, and PPO (Chapter 3)

Direct Preference Optimization (DPO) reframes the alignment problem to avoid training a separate reward model. The key insight: given a reward model, we can analytically compute what the optimal policy should be, and from that, derive a training objective directly on the policy.

The standard RLHF objective maximizes expected reward while penalizing deviation from a reference policy:

max_π E_{x~D, y~π(·|x)} [r(x,y)] - β * KL[π(y|x) || π_ref(y|x)]

where r(x,y) is the reward, π_ref is the reference policy (typically the SFT model), and β controls the KL penalty strength.

DPO shows that the optimal policy under this objective has a closed form:

π*(y|x) ∝ π_ref(y|x) * exp(r(x,y)/β)

This means we can rearrange to express the reward in terms of the optimal policy:

r(x,y) = β * log(π*(y|x)/π_ref(y|x)) + Z(x)

where Z(x) is a partition function independent of y. The reward is essentially the log-ratio of the optimal policy to the reference policy.

For preference data, we have two responses with known preferences. If response y_w is preferred over y_l:

r(x, y_w) > r(x, y_l)

Substituting the reward formula and using the Bradley-Terry model (which assumes the probability of preferring y_w over y_l is a logistic function of their reward difference), DPO derives a training objective that directly maximizes the probability of the preferred response:

# DPO loss (simplified)
def dpo_loss(policy_chosen, policy_rejected, ref_chosen, ref_rejected, beta=0.1):
    """
    policy_chosen: log probs from policy for chosen response
    policy_rejected: log probs from policy for rejected response
    ref_chosen: log probs from reference policy for chosen response
    ref_rejected: log probs from reference policy for rejected response
    """
    # Log ratios represent the learned reward
    chosen_reward = policy_chosen - ref_chosen
    rejected_reward = policy_rejected - ref_rejected
    
    # Sigmoid of reward difference—maximized when chosen > rejected
    loss = -torch.log(torch.sigmoid(chosen_reward - rejected_reward) + 1e-8)
    return loss.mean()

The beauty of DPO: we never need to train or sample from a reward model during policy optimization. The reference policy provides implicit reward signals through the log-ratio. This removes a whole training stage and eliminates reward model overfitting issues.