Alignment Evaluation — RLHF, DPO, and PPO (Chapter 15)

Evaluating alignment is fundamentally harder than evaluating capabilities because alignment involves human values, which are contested and context-dependent.

Evaluation Metrics Taxonomy

Automated Metrics:

Reward model scores (proxy for human preference)
Classifier-based safety scores
Response length and format consistency

Human Metrics:

Preference rankings (direct comparison)
Likert scales (quality ratings)
Adversarial probing (safety testing)

Behavioral Metrics:

Refusal rates on benign requests
Response quality on edge cases
Consistency under reframing attacks

Implementing Preference Evaluation

def evaluate_preference_alignment(model, eval_pairs):
    """Evaluate how often model prefers the same response as humans."""
    correct = 0
    total = 0
    
    for pair in eval_pairs:
        prompt = pair["prompt"]
        human_preferred = pair["chosen"]
        human_rejected = pair["rejected"]
        
        # Score both responses
        score_chosen = reward_model(model, prompt, human_preferred)
        score_rejected = reward_model(model, prompt, human_rejected)
        
        # Check if model would agree with human
        if score_chosen > score_rejected:
            correct += 1
        total += 1
    
    return correct / total

def evaluate_safety_behavior(model, test_prompts):
    """Evaluate safety-related behavior patterns."""
    results = {
        "benign_refusals": 0,
        "jailbreak_successes": 0,
        "harmful_request_handling": 0,
        "total": len(test_prompts)
    }
    
    for prompt in test_prompts:
        response = model.generate(prompt)
        
        if is_benign(prompt) and model_refused(response):
            results["benign_refusals"] += 1
        
        if is_jailbreak(prompt) and not model_refused(response):
            results["jailbreak_successes"] += 1
        
        if is_harmful(prompt) and handled_appropriately(response):
            results["harmful_request_handling"] += 1
    
    return results

A/B Testing for Alignment Changes

Compare two model versions on human preference:

# Collect paired comparisons
python collect_preference_data.py \
    --model-a base_model \
    --model-b aligned_model \
    --prompts eval_prompts.json \
    --output comparison_results.json

# Analyze statistical significance
python analyze_preference.py \
    --data comparison_results.json \
    --min-samples 200

Red-Teaming Evaluations

Adversarial evaluation finds alignment failures:

def red_team_evaluation(model, attack_budget=1000):
    """Automate red-teaming with attack generation."""
    attacks = []
    
    for category in ATTACK_CATEGORIES:
        for _ in range(attack_budget // len(ATTACK_CATEGORIES)):
            # Generate attack prompt
            attack = generate_attack(model, category)
            
            # Test response
            response = model.generate(attack)
            
            # Score severity
            severity = classify_severity(response)
            
            attacks.append({
                "attack": attack,
                "response": response,
                "severity": severity,
                "category": category
            })
    
    return aggregate_results(attacks)