Preference Optimization Overview — RLHF, DPO, and PPO (Chapter 2)

Preference optimization refers to a family of techniques that train language models to produce outputs aligned with human preferences. The key insight is that humans can compare two outputs and say which is better, even if they cannot write a perfect output themselves. This comparison signal is easier to obtain and more scalable than demonstration data.

The standard preference optimization pipeline has three stages:

Stage 1: Supervised Fine-Tuning produces a base model that can follow instructions. Without this, the model may not generate coherent responses at all, making preference learning inefficient. This stage uses human-written demonstrations or high-quality curated data.

Stage 2: Reward Model Training creates a neural network that takes a (prompt, response) pair and outputs a scalar score representing human preference. Training data consists of preference pairs: the same prompt, two different responses, and a human label indicating which is preferred. The reward model learns to score the preferred response higher than the rejected one.

Stage 3: Policy Optimization updates the language model to maximize reward. In PPO-based RLHF, this uses the reward model as a scoring function with KL-divergence constraints to prevent the policy from deviating too far from the SFT model. DPO-style methods reformulate this as a classification or regression problem directly on the policy, avoiding explicit reward models.

# Preference data structure
preference_example = {
    "prompt": "What is Python used for?",
    "chosen": "Python is a versatile programming language commonly used for web development, data analysis, automation, and machine learning...",
    "rejected": "Python. Yeah. It's a thing. Used for stuff. Look it up."
}
# The rejected response is grammatically acceptable but low-quality
# This contrast is what drives learning

Each stage has distinct failure modes. SFT failures produce incoherent or off-topic responses. Reward model failures manifest as reward hacking—models find ways to game the reward signal without actually improving output quality. Policy optimization failures include mode collapse (all outputs become identical), reward collapse (all outputs get maximum reward regardless of quality), and catastrophic forgetting of capabilities.