03. DPO Theory

Chapter 3 of 24 · 20 min

Direct Preference Optimization (DPO) reframes the alignment problem to avoid training a separate reward model. The key insight: given a reward model, we can analytically compute what the optimal policy should be, and from that, derive a training objective directly on the policy.

The standard RLHF objective maximizes expected reward while penalizing deviation from a reference policy:

max_π E_{x~D, y~π(·|x)} [r(x,y)] - β * KL[π(y|x) || π_ref(y|x)]

where r(x,y) is the reward, π_ref is the reference policy (typically the SFT model), and β controls the KL penalty strength.

DPO shows that the optimal policy under this objective has a closed form:

π*(y|x) ∝ π_ref(y|x) * exp(r(x,y)/β)

This means we can rearrange to express the reward in terms of the optimal policy:

r(x,y) = β * log(π*(y|x)/π_ref(y|x)) + Z(x)

where Z(x) is a partition function independent of y. The reward is essentially the log-ratio of the optimal policy to the reference policy.

For preference data, we have two responses with known preferences. If response y_w is preferred over y_l:

r(x, y_w) > r(x, y_l)

Substituting the reward formula and using the Bradley-Terry model (which assumes the probability of preferring y_w over y_l is a logistic function of their reward difference), DPO derives a training objective that directly maximizes the probability of the preferred response:

# DPO loss (simplified)
def dpo_loss(policy_chosen, policy_rejected, ref_chosen, ref_rejected, beta=0.1):
    """
    policy_chosen: log probs from policy for chosen response
    policy_rejected: log probs from policy for rejected response
    ref_chosen: log probs from reference policy for chosen response
    ref_rejected: log probs from reference policy for rejected response
    """
    # Log ratios represent the learned reward
    chosen_reward = policy_chosen - ref_chosen
    rejected_reward = policy_rejected - ref_rejected
    
    # Sigmoid of reward difference—maximized when chosen > rejected
    loss = -torch.log(torch.sigmoid(chosen_reward - rejected_reward) + 1e-8)
    return loss.mean()

The beauty of DPO: we never need to train or sample from a reward model during policy optimization. The reference policy provides implicit reward signals through the log-ratio. This removes a whole training stage and eliminates reward model overfitting issues.

EXERCISE

Implement the DPO loss from scratch using PyTorch. Verify that the loss gradient points in the right direction by computing gradients for a simple case and checking that they increase the probability of chosen responses relative to rejected responses. Test with random logits and confirm the loss is minimized when policy probs favor the chosen response.