RUNLOCALAIv38
->Will it run?Best GPUCompareTroubleshootStartLearnPulseModelsHardwareToolsBench
Run check
RUNLOCALAI

Independently operated catalog for local-AI hardware and software. Hand-written verdicts. Source-cited claims. Reproducible commands when we have them.

OP·Fredoline Eruo
DIR
  • Models
  • Hardware
  • Tools
  • Benchmarks
TOOLS
  • Will it run?
  • Compare hardware
  • Cost vs cloud
  • Choose my GPU
  • Prompting kits
  • Quick answers
REF
  • All buyer guides
  • Learn local AI
  • Methodology
  • Glossary
  • Errors KB
  • Trust
EDITOR
  • About
  • Author
  • How we make money
  • Editorial policy
  • Contact
LEGAL
  • Privacy
  • Terms
  • Sitemap
MAIL · MONTHLY DIGEST
Get monthly local AI changes
Monthly recap. No spam.
DISCLOSURE

Some links on this site are affiliate links (Amazon Associates and other first-class retailers). When you buy through them, we earn a small commission at no extra cost to you. Affiliate links do not influence our verdicts — there are cards we rate highly that we don't have affiliate relationships with, and cards that sell well that we refuse to recommend. Read more →

© 2026 runlocalai.coIndependently operated
RUNLOCALAI · v38
  1. >
  2. Home
  3. /Learn
  4. /Courses
  5. /RLHF, DPO, and PPO
  6. /Ch. 17
RLHF, DPO, and PPO

17. Constitutional AI

Chapter 17 of 24 · 20 min
KEY INSIGHT

Constitutional AI reduces the human labeling burden by using the model itself to generate feedback. This creates a self-supervised loop where the model learns to critique and improve its own outputs according to human-specified principles.

Constitutional AI (CAI) is Anthropic's approach to alignment that uses a set of principles ("constitution") to guide model behavior without requiring human feedback for every training example.

The Constitutional Framework

A constitution is a set of principles the model should follow:

CONSTITUTION = [
    "Choose the response that is least likely to contain harmful or unethical content.",
    "Choose the response that would be most helpful and informative.",
    "Choose the response that is most likely to be factual and accurate.",
    "Avoid responses that are deceptive, manipulative, or evasive.",
    "Prefer responses that acknowledge uncertainty when appropriate.",
    # ... 16 principles total in Anthropic's constitution
]

def constitutional_principle(index):
    return CONSTITUTION[index % len(CONSTITUTION)]

The CAI Training Process

Stage 1: Supervised Learning on Helpful Responses

def cai_stage1_training(model, prompts, helpful_responses):
    """Standard SFT on helpful demonstrations."""
    for prompt, response in zip(prompts, helpful_responses):
        loss = compute_sft_loss(model, prompt, response)
        backward(loss)

Stage 2: Constitutional Critique and Revision

def constitutional_critique(model, response, principle):
    """Generate critique of response based on constitutional principle."""
    critique_prompt = f"""Review the following response according to this principle:
    Principle: {principle}
    
    Response: {response}
    
    Identify specific ways the response violates or fails to meet the principle.
    Be specific and constructive."""
    
    critique = model.generate(critique_prompt)
    return critique

def constitutional_revision(model, response, critique):
    """Revise response based on critique."""
    revision_prompt = f"""Original response: {response}
    
    Critique: {critique}
    
    Rewrite the response to address the critique while maintaining its helpfulness.
    Focus on improving the specific issues identified."""
    
    revised = model.generate(revision_prompt)
    return revised

def cai_stage2_training(model, prompts, initial_responses):
    """Train on critiqued and revised responses."""
    for prompt, initial in zip(prompts, initial_responses):
        for i, principle in enumerate(CONSTITUTION):
            # Critique
            critique = constitutional_critique(model, initial, principle)
            
            # Revise
            revised = constitutional_revision(model, initial, critique)
            
            # Train on revision
            loss = compute_sft_loss(model, prompt, revised)
            backward(loss)

RLAIF: AI-Assisted Preference Generation

Constitutional AI also enables scalable preference annotation:

def rlaif_preference_generation(prompt, responses, clm_model):
    """Use a model to generate preferences between responses."""
    for response_a, response_b in generate_pairs(responses):
        # Ask model which is better according to principles
        judgment_prompt = f"""Prompt: {prompt}
        
        Response A: {response_a}
        Response B: {response_b}
        
        Which response is better according to the principles of helpfulness and harmlessness?
        Consider accuracy, safety, and usefulness."""
        
        judgment = clm_model.generate(judgment_prompt)
        preference = parse_judgment(judgment)
        
        yield {"prompt": prompt, "chosen": preference.chosen, "rejected": preference.rejected}
EXERCISE

Implement a simplified constitutional AI training loop with 4 principles (helpful, harmless, honest, clear). Train for one epoch and compare against standard SFT.

← Chapter 16
Helpfulness vs Harmlessness
Chapter 18 →
RRHF and IPO