RUNLOCALAIv38
->Will it run?Best GPUCompareTroubleshootStartLearnPulseModelsHardwareToolsBench
Run check
RUNLOCALAI

Independently operated catalog for local-AI hardware and software. Hand-written verdicts. Source-cited claims. Reproducible commands when we have them.

OP·Fredoline Eruo
DIR
  • Models
  • Hardware
  • Tools
  • Benchmarks
TOOLS
  • Will it run?
  • Compare hardware
  • Cost vs cloud
  • Choose my GPU
  • Prompting kits
  • Quick answers
REF
  • All buyer guides
  • Learn local AI
  • Methodology
  • Glossary
  • Errors KB
  • Trust
EDITOR
  • About
  • Author
  • How we make money
  • Editorial policy
  • Contact
LEGAL
  • Privacy
  • Terms
  • Sitemap
MAIL · MONTHLY DIGEST
Get monthly local AI changes
Monthly recap. No spam.
DISCLOSURE

Some links on this site are affiliate links (Amazon Associates and other first-class retailers). When you buy through them, we earn a small commission at no extra cost to you. Affiliate links do not influence our verdicts — there are cards we rate highly that we don't have affiliate relationships with, and cards that sell well that we refuse to recommend. Read more →

© 2026 runlocalai.coIndependently operated
RUNLOCALAI · v38
  1. >
  2. Home
  3. /Learn
  4. /Courses
  5. /RLHF, DPO, and PPO
  6. /Ch. 14
RLHF, DPO, and PPO

14. Iterated Training

Chapter 14 of 24 · 20 min
KEY INSIGHT

Iterated training creates feedback loops between model behavior and data collection. The model shapes what data gets collected, which shapes the next model. Managing this cycle—preventing the model from collapsing into narrow patterns—requires explicit monitoring and intervention at each iteration.

Alignment training rarely succeeds in a single pass. Iterated training—cycling between training and evaluation—allows progressive refinement of model behavior.

The Training Loop Architecture

┌─────────────┐    ┌─────────────┐    ┌─────────────┐
│   SFT on    │───▶│   Reward    │───▶│   PPO/PPO   │
│ 高质量数据   │    │   Training  │    │   Training  │
└─────────────┘    └─────────────┘    └─────────────┘
       ▲                                    │
       │         ┌─────────────┐             │
       └─────────│  Evaluation │◀────────────┘
                 │   & Filter  │
                 └─────────────┘

Iteration Monitoring

Track metrics across iterations to detect degradation:

class TrainingIterator:
    def __init__(self, base_model, config):
        self.model = base_model
        self.iteration = 0
        self.metrics_history = []
        
    def step(self, training_data):
        # Train for one iteration
        self.model = sft_train(self.model, training_data)
        self.model = reward_train(self.model, training_data)
        self.model = ppo_train(self.model, training_data)
        
        # Evaluate
        metrics = evaluate_alignment(self.model)
        self.metrics_history.append(metrics)
        
        # Check for divergence
        if self.detect_degradation():
            print(f"WARNING: Degradation detected at iteration {self.iteration}")
            # Trigger rollback or intervention
        
        self.iteration += 1
        return self.model
    
    def detect_degradation(self):
        if len(self.metrics_history) < 3:
            return False
        
        # Check reward model accuracy trend
        recent = self.metrics_history[-3:]
        if all(m["reward_accuracy"] < 0.6 for m in recent):
            return True
        
        # Check for capability regression
        if self.metrics_history[-1]["task_accuracy"] < self.metrics_history[-3]["task_accuracy"] - 0.05:
            return True
        
        return False

Early Stopping Criteria

Not all divergence indicates failure—some iterations improve alignment without improving capabilities:

def should_continue_training(iteration, metrics):
    # Stop if alignment plateaus
    if metrics["alignment_score"] > 0.95:
        return False, "Alignment target reached"
    
    # Stop if capabilities degrade significantly
    if metrics["task_accuracy"] < 0.70:
        return False, "Capability regression"
    
    # Stop if training becomes unstable
    if metrics["reward_var"] > 2.0:
        return False, "Training instability"
    
    # Stop if diminishing returns
    if len(metrics_history) > 5:
        recent_improvement = metrics_history[-1]["alignment"] - metrics_history[-5]["alignment"]
        if recent_improvement < 0.01:
            return False, "Diminishing returns"
    
    return True, "Continue training"

Data Reweighting Across Iterations

Later iterations should weight data differently as the model matures:

def compute_iteration_weights(examples, iteration):
    base_weights = compute_quality_weights(examples)
    
    if iteration < 3:
        # Early iterations: focus on basic safety
        safety_multiplier = 2.0
    elif iteration < 6:
        # Middle iterations: balance safety and helpfulness
        safety_multiplier = 1.0
    else:
        # Late iterations: emphasize nuanced responses
        helpfulness_multiplier = 1.5
    
    # Apply iteration-specific adjustments
    for ex in examples:
        if ex["safety_critical"]:
            ex["weight"] *= safety_multiplier
        if ex["nuanced"]:
            ex["weight"] *= helpfulness_multiplier
    
    return examples
EXERCISE

Implement a simple iterated training loop that trains for 3 iterations, monitoring reward model accuracy and task performance. Visualize how metrics change across iterations.

← Chapter 13
Data Quality Filtering
Chapter 15 →
Alignment Evaluation