RUNLOCALAIv38
->Will it run?Best GPUCompareTroubleshootStartLearnPulseModelsHardwareToolsBench
Run check
RUNLOCALAI

Independently operated catalog for local-AI hardware and software. Hand-written verdicts. Source-cited claims. Reproducible commands when we have them.

OP·Fredoline Eruo
DIR
  • Models
  • Hardware
  • Tools
  • Benchmarks
TOOLS
  • Will it run?
  • Compare hardware
  • Cost vs cloud
  • Choose my GPU
  • Prompting kits
  • Quick answers
REF
  • All buyer guides
  • Learn local AI
  • Methodology
  • Glossary
  • Errors KB
  • Trust
EDITOR
  • About
  • Author
  • How we make money
  • Editorial policy
  • Contact
LEGAL
  • Privacy
  • Terms
  • Sitemap
MAIL · MONTHLY DIGEST
Get monthly local AI changes
Monthly recap. No spam.
DISCLOSURE

Some links on this site are affiliate links (Amazon Associates and other first-class retailers). When you buy through them, we earn a small commission at no extra cost to you. Affiliate links do not influence our verdicts — there are cards we rate highly that we don't have affiliate relationships with, and cards that sell well that we refuse to recommend. Read more →

© 2026 runlocalai.coIndependently operated
RUNLOCALAI · v38
  1. >
  2. Home
  3. /Learn
  4. /Courses
  5. /RLHF, DPO, and PPO
  6. /Ch. 2
RLHF, DPO, and PPO

02. Preference Optimization Overview

Chapter 2 of 24 · 15 min
KEY INSIGHT

Preference optimization separates the "what is good" question (reward modeling) from the "how to be good" question (policy optimization). This separation allows each component to be trained, evaluated, and debugged independently—but also means failures in reward modeling propagate to policy optimization with no intermediate correction.

Preference optimization refers to a family of techniques that train language models to produce outputs aligned with human preferences. The key insight is that humans can compare two outputs and say which is better, even if they cannot write a perfect output themselves. This comparison signal is easier to obtain and more scalable than demonstration data.

The standard preference optimization pipeline has three stages:

Stage 1: Supervised Fine-Tuning produces a base model that can follow instructions. Without this, the model may not generate coherent responses at all, making preference learning inefficient. This stage uses human-written demonstrations or high-quality curated data.

Stage 2: Reward Model Training creates a neural network that takes a (prompt, response) pair and outputs a scalar score representing human preference. Training data consists of preference pairs: the same prompt, two different responses, and a human label indicating which is preferred. The reward model learns to score the preferred response higher than the rejected one.

Stage 3: Policy Optimization updates the language model to maximize reward. In PPO-based RLHF, this uses the reward model as a scoring function with KL-divergence constraints to prevent the policy from deviating too far from the SFT model. DPO-style methods reformulate this as a classification or regression problem directly on the policy, avoiding explicit reward models.

# Preference data structure
preference_example = {
    "prompt": "What is Python used for?",
    "chosen": "Python is a versatile programming language commonly used for web development, data analysis, automation, and machine learning...",
    "rejected": "Python. Yeah. It's a thing. Used for stuff. Look it up."
}
# The rejected response is grammatically acceptable but low-quality
# This contrast is what drives learning

Each stage has distinct failure modes. SFT failures produce incoherent or off-topic responses. Reward model failures manifest as reward hacking—models find ways to game the reward signal without actually improving output quality. Policy optimization failures include mode collapse (all outputs become identical), reward collapse (all outputs get maximum reward regardless of quality), and catastrophic forgetting of capabilities.

EXERCISE

Find a dataset of preference pairs (like Anthropic's HH-RLHF or OpenAI's Summarize dataset). Write code to compute the agreement rate between different annotators on the same pairs. Calculate what percentage of pairs have unanimous agreement versus mixed preferences. This tells you how noisy your training signal will be.

← Chapter 1
Why Alignment?
Chapter 3 →
DPO Theory