Benchmarking Alignment — RLHF, DPO, and PPO (Chapter 23)

Standard benchmarks measure capabilities, not alignment. This chapter covers specialized benchmarks for evaluating alignment quality.

Existing Alignment Benchmarks

BBQ: Biases in Question Answering

Tests for demographic biases in model responses
Multiple-choice format with known correct answers

BOLD: Bias in Large Language Models

Tests model outputs for demographic bias across categories

TruthfulQA: Truthfulness evaluation

Tests whether models generate false statements
Adversarial questions where models are likely to err

RealToxicityPrompts: Toxicity detection

Tests tendency to generate toxic content
Uses Perspective API for scoring

Implementing Custom Alignment Benchmarks

class AlignmentBenchmark:
    def __init__(self, model, evaluator_config):
        self.model = model
        self.categories = {
            "safety": SafetyTests(),
            "helpfulness": HelpfulnessTests(),
            "honesty": HonestyTests(),
            "fairness": FairnessTests()
        }
    
    def run(self):
        results = {}
        
        for category, tests in self.categories.items():
            category_results = []
            
            for test in tests:
                try:
                    result = self.run_single_test(test)
                    category_results.append(result)
                except TestError as e:
                    print(f"Test {test.name} failed: {e}")
                    category_results.append({"error": str(e)})
            
            results[category] = {
                "scores": [r["score"] for r in category_results if "score" in r],
                "details": category_results
            }
        
        return self.summarize(results)
    
    def run_single_test(self, test):
        prompt = test.prompt()
        response = self.model.generate(prompt)
        
        return {
            "prompt": prompt,
            "response": response,
            "expected_behavior": test.expected(),
            "observed_behavior": test.evaluate(response),
            "score": test.score(response)
        }

Preference Agreement Benchmark

def preference_agreement_benchmark(model, eval_pairs, human_reference):
    """
    Measure how often model preferences match human preferences.
    """
    agreements = 0
    total = 0
    
    for pair in eval_pairs:
        prompt = pair["prompt"]
        chosen = pair["chosen"]
        rejected = pair["rejected"]
        
        # Get model preference
        model_chooses_chosen = model_prefers(model, prompt, chosen, rejected)
        
        # Check against human
        human_chooses_chosen = (human_reference[prompt]["preferred"] == "chosen")
        
        if model_chooses_chosen == human_chooses_chosen:
            agreements += 1
        total += 1
    
    return {
        "agreement_rate": agreements / total,
        "agreements": agreements,
        "total": total
    }

Red-Team Benchmarking

# Automated red-team benchmark
python red_team_benchmark.py \
    --model aligned_model \
    --attack_count 500 \
    --categories ["jailbreak", "social_engineering", "harmful_content"] \
    --output red_team_results.json

# Generate attack severity report
python analyze_attacks.py \
    --results red_team_results.json \
    --severity_threshold 0.7 \
    --output attack_report.html

Benchmark Comparison Table

Benchmark	Focus	Format	Limitations
BBQ	Bias detection	MCQ	Limited coverage
TruthfulQA	Factuality	Open-ended	Human evaluation needed
RealToxicity	Toxicity	Generation	Perspective API dependency
Custom	Task-specific	Flexible	No standardization