Constitutional AI — RLHF, DPO, and PPO (Chapter 17)

Constitutional AI (CAI) is Anthropic's approach to alignment that uses a set of principles ("constitution") to guide model behavior without requiring human feedback for every training example.

The Constitutional Framework

A constitution is a set of principles the model should follow:

CONSTITUTION = [
    "Choose the response that is least likely to contain harmful or unethical content.",
    "Choose the response that would be most helpful and informative.",
    "Choose the response that is most likely to be factual and accurate.",
    "Avoid responses that are deceptive, manipulative, or evasive.",
    "Prefer responses that acknowledge uncertainty when appropriate.",
    # ... 16 principles total in Anthropic's constitution
]

def constitutional_principle(index):
    return CONSTITUTION[index % len(CONSTITUTION)]

The CAI Training Process

Stage 1: Supervised Learning on Helpful Responses

def cai_stage1_training(model, prompts, helpful_responses):
    """Standard SFT on helpful demonstrations."""
    for prompt, response in zip(prompts, helpful_responses):
        loss = compute_sft_loss(model, prompt, response)
        backward(loss)

Stage 2: Constitutional Critique and Revision

def constitutional_critique(model, response, principle):
    """Generate critique of response based on constitutional principle."""
    critique_prompt = f"""Review the following response according to this principle:
    Principle: {principle}
    
    Response: {response}
    
    Identify specific ways the response violates or fails to meet the principle.
    Be specific and constructive."""
    
    critique = model.generate(critique_prompt)
    return critique

def constitutional_revision(model, response, critique):
    """Revise response based on critique."""
    revision_prompt = f"""Original response: {response}
    
    Critique: {critique}
    
    Rewrite the response to address the critique while maintaining its helpfulness.
    Focus on improving the specific issues identified."""
    
    revised = model.generate(revision_prompt)
    return revised

def cai_stage2_training(model, prompts, initial_responses):
    """Train on critiqued and revised responses."""
    for prompt, initial in zip(prompts, initial_responses):
        for i, principle in enumerate(CONSTITUTION):
            # Critique
            critique = constitutional_critique(model, initial, principle)
            
            # Revise
            revised = constitutional_revision(model, initial, critique)
            
            # Train on revision
            loss = compute_sft_loss(model, prompt, revised)
            backward(loss)

RLAIF: AI-Assisted Preference Generation

Constitutional AI also enables scalable preference annotation:

def rlaif_preference_generation(prompt, responses, clm_model):
    """Use a model to generate preferences between responses."""
    for response_a, response_b in generate_pairs(responses):
        # Ask model which is better according to principles
        judgment_prompt = f"""Prompt: {prompt}
        
        Response A: {response_a}
        Response B: {response_b}
        
        Which response is better according to the principles of helpfulness and harmlessness?
        Consider accuracy, safety, and usefulness."""
        
        judgment = clm_model.generate(judgment_prompt)
        preference = parse_judgment(judgment)
        
        yield {"prompt": prompt, "chosen": preference.chosen, "rejected": preference.rejected}