RUNLOCALAIv38
->Will it run?Best GPUCompareTroubleshootStartLearnPulseModelsHardwareToolsBench
Run check
RUNLOCALAI

Independently operated catalog for local-AI hardware and software. Hand-written verdicts. Source-cited claims. Reproducible commands when we have them.

OP·Fredoline Eruo
DIR
  • Models
  • Hardware
  • Tools
  • Benchmarks
TOOLS
  • Will it run?
  • Compare hardware
  • Cost vs cloud
  • Choose my GPU
  • Prompting kits
  • Quick answers
REF
  • All buyer guides
  • Learn local AI
  • Methodology
  • Glossary
  • Errors KB
  • Trust
EDITOR
  • About
  • Author
  • How we make money
  • Editorial policy
  • Contact
LEGAL
  • Privacy
  • Terms
  • Sitemap
MAIL · MONTHLY DIGEST
Get monthly local AI changes
Monthly recap. No spam.
DISCLOSURE

Some links on this site are affiliate links (Amazon Associates and other first-class retailers). When you buy through them, we earn a small commission at no extra cost to you. Affiliate links do not influence our verdicts — there are cards we rate highly that we don't have affiliate relationships with, and cards that sell well that we refuse to recommend. Read more →

© 2026 runlocalai.coIndependently operated
RUNLOCALAI · v38
  1. >
  2. Home
  3. /Learn
  4. /Courses
  5. /RLHF, DPO, and PPO
  6. /Ch. 9
RLHF, DPO, and PPO

09. PPO Theory

Chapter 9 of 24 · 20 min
KEY INSIGHT

PPO's clipping is a conservative mechanism that prevents the policy from "jumping" to a new distribution in a single step. This is crucial when the reward landscape is noisy—as it always is with learned reward models. The KL penalty serves a similar purpose but operates as a soft constraint rather than a hard boundary.

Proximal Policy Optimization (PPO) is the workhorse algorithm behind most production RLHF systems. It directly optimizes a policy to maximize expected reward while constraining updates to avoid catastrophic policy degradation.

The PPO objective extends the standard policy gradient with a clipped surrogate objective:

def ppo_objective(ratio, advantages, epsilon=0.2):
    """
    ratio: pi_theta(a|s) / pi_theta_old(a|s) - probability ratio
    advantages: estimated advantage of taking action a in state s
    epsilon: clipping parameter (typically 0.1-0.2)
    """
    # Unclipped objective
    unclipped = ratio * advantages
    
    # Clipped objective - prevents large updates
    clipped_ratio = torch.clamp(ratio, 1 - epsilon, 1 + epsilon)
    clipped = clipped_ratio * advantages
    
    # Take the minimum of clipped and unclipped
    # This makes the objective a lower bound on the true improvement
    return -torch.min(unclipped, clipped).mean()

The clipping prevents the policy from changing too much in a single update. Without clipping, large policy updates can collapse the policy to a degenerate distribution.

KL penalty approach: An alternative to clipping is adding a KL penalty to the objective:

def ppo_objective_with_kl(log_probs, ref_log_probs, rewards, beta=0.1):
    """
    log_probs: current policy log probs
    ref_log_probs: reference (SFT) policy log probs
    rewards: reward model scores
    beta: KL coefficient
    """
    # KL divergence penalty
    kl = log_probs - ref_log_probs
    kl_penalty = -beta * kl
    
    # Combine with rewards
    loss = -(rewards + kl_penalty).mean()
    return loss

This is the approach used in TRL's PPOTrainer and is mathematically equivalent to the DPO implicit reward when properly initialized.

The PPO algorithm flow:

# Pseudocode for PPO training step
def ppo_step(response_log_probs, response_rewards, ref_log_probs, epsilon=0.2):
    # 1. Compute probability ratio
    ratio = torch.exp(response_log_probs - old_log_probs)
    
    # 2. Compute advantages (using reward model scores)
    advantages = normalize(response_rewards)
    
    # 3. Compute clipped surrogate loss
    unclipped_loss = ratio * advantages
    clipped_ratio = torch.clamp(ratio, 1 - epsilon, 1 + epsilon)
    clipped_loss = clipped_ratio * advantages
    
    # 4. Add KL penalty for reference model
    kl = response_log_probs - ref_log_probs
    kl_loss = -0.1 * kl.mean()
    
    # 5. Total loss
    total_loss = -torch.min(unclipped_loss, clipped_loss).mean() + kl_loss
    
    return total_loss

Value function: PPO typically uses a value function (critic) to estimate expected returns and reduce variance. In TRL's implementation, this is handled internally, but understanding it helps with debugging.

EXERCISE

Implement a minimal PPO training loop in PyTorch without using TRL. Use a simple environment (like a bandit with known reward distribution) to verify your implementation. Check that the policy improves over time and that the KL divergence from the initial policy stays bounded. Plot the KL divergence over training steps to see how clipping affects the update magnitude.

← Chapter 8
Reward Model Evaluation
Chapter 10 →
PPO with KL Control