Constitutional AI — AI Safety and Alignment (Chapter 15)

Constitutional AI (CAI) aligns models by training them to critique and revise their own outputs according to a written set of principles. This approach reduces reliance on human labels for every training example.

The CAI Training Process

class ConstitutionalAI:
    """Implement Constitutional AI training pipeline."""
    
    def __init__(self, model, principles: list[str]):
        self.model = model
        self.principles = principles
        
    def generate_critique(self, response: str, prompt: str) -> str:
        """Generate critique based on constitutional principles."""
        critique_prompt = f"""Review the following response and identify 
        any ways it violates the principles below.

Principles:
{chr(10).join(f'{i+1}. {p}' for i, p in enumerate(self.principles))}

Prompt: {prompt}
Response: {response}

Identify specific violations:"""
        
        return self.model.generate(critique_prompt)
    
    def generate_revision(self, response: str, critique: str) -> str:
        """Revise response to address critique."""
        revision_prompt = f"""Please revise the response below to address 
        the identified critique.

Original response: {response}

Critique: {critique}

Revised response:"""
        
        return self.model.generate(revision_prompt)
    
    def preference_dataset_from_constitutional(
        self, prompt: str, initial_response: str
    ) -> tuple[str, str, str]:
        """Create preference pair from constitutional critique cycle."""
        critique = self.generate_critique(initial_response, prompt)
        revision = self.generate_revision(initial_response, critique)
        
        # Preference: revised response is better than initial
        return prompt, initial_response, revision

Training with Preference Pairs

def train_constitutional_preference_model(
    base_model, constitutional_ai, training_prompts, learning_rate=1e-5
):
    """Fine-tune model using constitutional preference pairs."""
    optimizer = torch.optim.AdamW(base_model.parameters(), lr=learning_rate)
    
    for prompt in training_prompts:
        initial = base_model.generate(prompt)
        prompt, response_a, response_b = constitutional_ai.preference_dataset_from_constitutional(
            prompt, initial
        )
        
        # Reward model prefers revision (response_b)
        reward_a = compute_safety_reward(prompt, response_a)
        reward_b = compute_safety_reward(prompt, response_b)
        
        preference_target = torch.tensor([0.0, 1.0])  # B preferred over A
        
        # Compute preference loss
        loss = preference_loss(
            reward_a, reward_b, preference_target
        )
        
        loss.backward()
        optimizer.step()
        optimizer.zero_grad()

Local verification checkpoint

Run the smallest example from this chapter in a local workspace and record the package version, runtime, data path, and observed output. If the result depends on model size, vector count, CPU/GPU backend, or available memory, note that constraint beside the exercise so the lesson remains reproducible.