RUNLOCALAIv38
->Will it run?Best GPUCompareTroubleshootStartLearnPulseModelsHardwareToolsBench
Run check
RUNLOCALAI

Independently operated catalog for local-AI hardware and software. Hand-written verdicts. Source-cited claims. Reproducible commands when we have them.

OP·Fredoline Eruo
DIR
  • Models
  • Hardware
  • Tools
  • Benchmarks
TOOLS
  • Will it run?
  • Compare hardware
  • Cost vs cloud
  • Choose my GPU
  • Prompting kits
  • Quick answers
REF
  • All buyer guides
  • Learn local AI
  • Methodology
  • Glossary
  • Errors KB
  • Trust
EDITOR
  • About
  • Author
  • How we make money
  • Editorial policy
  • Contact
LEGAL
  • Privacy
  • Terms
  • Sitemap
MAIL · MONTHLY DIGEST
Get monthly local AI changes
Monthly recap. No spam.
DISCLOSURE

Some links on this site are affiliate links (Amazon Associates and other first-class retailers). When you buy through them, we earn a small commission at no extra cost to you. Affiliate links do not influence our verdicts — there are cards we rate highly that we don't have affiliate relationships with, and cards that sell well that we refuse to recommend. Read more →

© 2026 runlocalai.coIndependently operated
RUNLOCALAI · v38
  1. >
  2. Home
  3. /Learn
  4. /Courses
  5. /RLHF, DPO, and PPO
  6. /Ch. 18
RLHF, DPO, and PPO

18. RRHF and IPO

Chapter 18 of 24 · 20 min
KEY INSIGHT

RRHF and IPO avoid PPO's complexity by treating alignment as a simpler ranking or classification problem. IPO's explicit regularization makes it more stable, while RRHF's score-based approach is more flexible for multi-response scenarios.

RRHF (Rank Responses from Human Feedback) and IPO (Identity Preference Optimization) are alternative approaches to alignment that avoid the complexity of PPO while maintaining effective preference learning.

RRHF: Score-Based Ranking

RRHF trains a model to score responses like a reward model:

def rrhf_loss(model, prompt, responses, reward_model):
    """
    RRHF loss: train model to rank responses the same as reward model.
    """
    scores = [reward_model(prompt, resp) for resp in responses]
    
    # Sort by reward
    sorted_pairs = sorted(zip(responses, scores), key=lambda x: x[1], reverse=True)
    sorted_responses = [r for r, s in sorted_pairs]
    
    # Create preference pairs from ranking
    total_loss = 0.0
    for i in range(len(sorted_responses)):
        for j in range(i + 1, len(sorted_responses)):
            # Positive: higher-ranked response
            # Negative: lower-ranked response
            pos_logits = model(prompt, sorted_responses[i])
            neg_logits = model(prompt, sorted_responses[j])
            
            # Softmax ranking loss
            loss = -torch.log_softmax([pos_logits, neg_logits], dim=0)[0]
            total_loss += loss
    
    return total_loss / (len(sorted_responses) ** 2)

IPO: Regularized Preference Optimization

IPO adds an explicit regularization term to prevent model collapse:

def ipo_loss(model, prompt, chosen, rejected, beta=0.1):
    """
    Identity Preference Optimization.
    Directly optimizes that chosen > rejected without KL penalty.
    """
    chosen_logps = model.log_prob(prompt, chosen)
    rejected_logps = model.log_prob(prompt, rejected)
    
    # Simple pairwise loss with stronger regularization
    # The beta parameter controls the margin
    loss = -torch.log(torch.sigmoid(chosen_logps - rejected_logps - beta))
    
    return loss.mean()

def dpo_loss(model, prompt, chosen, rejected, beta=0.1):
    """
    Direct Preference Optimization (DPO) for comparison.
    """
    policy_logps = model.log_prob(prompt, chosen) - model.log_prob(prompt, rejected)
    reference_logps = model.reference_log_prob(prompt, chosen) - model.reference_log_prob(prompt, rejected)
    
    # DPO has implicit regularization through reference
    loss = -torch.log(torch.sigmoid(
        beta * (policy_logps - reference_logps)
    ))
    
    return loss.mean()

Comparing DPO and IPO Convergence

def compare_convergence():
    """Compare DPO vs IPO on a simple task."""
    model = load_base_model()
    
    # Track divergence from reference
    dpo_divergences = []
    ipo_divergences = []
    
    for step in range(1000):
        batch = sample_preference_batch()
        
        # DPO update
        dpo_loss_value = dpo_loss(model, **batch)
        dpo_loss_value.backward()
        
        # Track divergence
        kl = compute_kl_divergence(model, model.reference)
        dpo_divergences.append(kl)
        
        # IPO update
        ipo_loss_value = ipo_loss(model, **batch)
        ipo_loss_value.backward()
        
        kl = compute_kl_divergence(model, model.reference)
        ipo_divergences.append(kl)
    
    plot_convergence(dpo_divergences, ipo_divergences)
EXERCISE

Implement both DPO and IPO loss functions and compare their behavior on a small dataset. Measure both preference accuracy and KL divergence from the reference model.

← Chapter 17
Constitutional AI
Chapter 19 →
ORPO