RUNLOCALAIv38
->Will it run?Best GPUCompareTroubleshootStartLearnPulseModelsHardwareToolsBench
Run check
RUNLOCALAI

Independently operated catalog for local-AI hardware and software. Hand-written verdicts. Source-cited claims. Reproducible commands when we have them.

OP·Fredoline Eruo
DIR
  • Models
  • Hardware
  • Tools
  • Benchmarks
TOOLS
  • Will it run?
  • Compare hardware
  • Cost vs cloud
  • Choose my GPU
  • Prompting kits
  • Quick answers
REF
  • All buyer guides
  • Learn local AI
  • Methodology
  • Glossary
  • Errors KB
  • Trust
EDITOR
  • About
  • Author
  • How we make money
  • Editorial policy
  • Contact
LEGAL
  • Privacy
  • Terms
  • Sitemap
MAIL · MONTHLY DIGEST
Get monthly local AI changes
Monthly recap. No spam.
DISCLOSURE

Some links on this site are affiliate links (Amazon Associates and other first-class retailers). When you buy through them, we earn a small commission at no extra cost to you. Affiliate links do not influence our verdicts — there are cards we rate highly that we don't have affiliate relationships with, and cards that sell well that we refuse to recommend. Read more →

© 2026 runlocalai.coIndependently operated
RUNLOCALAI · v38
  1. >
  2. Home
  3. /Learn
  4. /Courses
  5. /RLHF, DPO, and PPO
  6. /Ch. 23
RLHF, DPO, and PPO

23. Benchmarking Alignment

Chapter 23 of 24 · 20 min
KEY INSIGHT

No single benchmark captures alignment fully. Effective evaluation requires combining multiple benchmarks covering different aspects—safety, helpfulness, honesty, and fairness—and designing custom tests for domain-specific concerns.

Standard benchmarks measure capabilities, not alignment. This chapter covers specialized benchmarks for evaluating alignment quality.

Existing Alignment Benchmarks

BBQ: Biases in Question Answering

  • Tests for demographic biases in model responses
  • Multiple-choice format with known correct answers

BOLD: Bias in Large Language Models

  • Tests model outputs for demographic bias across categories

TruthfulQA: Truthfulness evaluation

  • Tests whether models generate false statements
  • Adversarial questions where models are likely to err

RealToxicityPrompts: Toxicity detection

  • Tests tendency to generate toxic content
  • Uses Perspective API for scoring

Implementing Custom Alignment Benchmarks

class AlignmentBenchmark:
    def __init__(self, model, evaluator_config):
        self.model = model
        self.categories = {
            "safety": SafetyTests(),
            "helpfulness": HelpfulnessTests(),
            "honesty": HonestyTests(),
            "fairness": FairnessTests()
        }
    
    def run(self):
        results = {}
        
        for category, tests in self.categories.items():
            category_results = []
            
            for test in tests:
                try:
                    result = self.run_single_test(test)
                    category_results.append(result)
                except TestError as e:
                    print(f"Test {test.name} failed: {e}")
                    category_results.append({"error": str(e)})
            
            results[category] = {
                "scores": [r["score"] for r in category_results if "score" in r],
                "details": category_results
            }
        
        return self.summarize(results)
    
    def run_single_test(self, test):
        prompt = test.prompt()
        response = self.model.generate(prompt)
        
        return {
            "prompt": prompt,
            "response": response,
            "expected_behavior": test.expected(),
            "observed_behavior": test.evaluate(response),
            "score": test.score(response)
        }

Preference Agreement Benchmark

def preference_agreement_benchmark(model, eval_pairs, human_reference):
    """
    Measure how often model preferences match human preferences.
    """
    agreements = 0
    total = 0
    
    for pair in eval_pairs:
        prompt = pair["prompt"]
        chosen = pair["chosen"]
        rejected = pair["rejected"]
        
        # Get model preference
        model_chooses_chosen = model_prefers(model, prompt, chosen, rejected)
        
        # Check against human
        human_chooses_chosen = (human_reference[prompt]["preferred"] == "chosen")
        
        if model_chooses_chosen == human_chooses_chosen:
            agreements += 1
        total += 1
    
    return {
        "agreement_rate": agreements / total,
        "agreements": agreements,
        "total": total
    }

Red-Team Benchmarking

# Automated red-team benchmark
python red_team_benchmark.py \
    --model aligned_model \
    --attack_count 500 \
    --categories ["jailbreak", "social_engineering", "harmful_content"] \
    --output red_team_results.json

# Generate attack severity report
python analyze_attacks.py \
    --results red_team_results.json \
    --severity_threshold 0.7 \
    --output attack_report.html

Benchmark Comparison Table

Benchmark Focus Format Limitations
BBQ Bias detection MCQ Limited coverage
TruthfulQA Factuality Open-ended Human evaluation needed
RealToxicity Toxicity Generation Perspective API dependency
Custom Task-specific Flexible No standardization
EXERCISE

Run your aligned model through TruthfulQA and RealToxicityPrompts benchmarks. Compare results against the base model and document areas of improvement and regression.

← Chapter 22
Alignment on Consumer GPU
Chapter 24 →
Model Alignment Pipeline Project