RUNLOCALAIv38
->Will it run?Best GPUCompareTroubleshootStartLearnPulseModelsHardwareToolsBench
Run check
RUNLOCALAI

Independently operated catalog for local-AI hardware and software. Hand-written verdicts. Source-cited claims. Reproducible commands when we have them.

OP·Fredoline Eruo
DIR
  • Models
  • Hardware
  • Tools
  • Benchmarks
TOOLS
  • Will it run?
  • Compare hardware
  • Cost vs cloud
  • Choose my GPU
  • Prompting kits
  • Quick answers
REF
  • All buyer guides
  • Learn local AI
  • Methodology
  • Glossary
  • Errors KB
  • Trust
EDITOR
  • About
  • Author
  • How we make money
  • Editorial policy
  • Contact
LEGAL
  • Privacy
  • Terms
  • Sitemap
MAIL · MONTHLY DIGEST
Get monthly local AI changes
Monthly recap. No spam.
DISCLOSURE

Some links on this site are affiliate links (Amazon Associates and other first-class retailers). When you buy through them, we earn a small commission at no extra cost to you. Affiliate links do not influence our verdicts — there are cards we rate highly that we don't have affiliate relationships with, and cards that sell well that we refuse to recommend. Read more →

© 2026 runlocalai.coIndependently operated
RUNLOCALAI · v38
  1. >
  2. Home
  3. /Learn
  4. /Courses
  5. /AI Safety and Alignment
  6. /Ch. 15
AI Safety and Alignment

15. Constitutional AI

Chapter 15 of 18 · 15 min
KEY INSIGHT

Constitutional AI makes the alignment specification explicit and auditable. By writing principles in plain language, non-technical stakeholders can review and contest the model's values directly.

Constitutional AI (CAI) aligns models by training them to critique and revise their own outputs according to a written set of principles. This approach reduces reliance on human labels for every training example.

The CAI Training Process

class ConstitutionalAI:
    """Implement Constitutional AI training pipeline."""
    
    def __init__(self, model, principles: list[str]):
        self.model = model
        self.principles = principles
        
    def generate_critique(self, response: str, prompt: str) -> str:
        """Generate critique based on constitutional principles."""
        critique_prompt = f"""Review the following response and identify 
        any ways it violates the principles below.

Principles:
{chr(10).join(f'{i+1}. {p}' for i, p in enumerate(self.principles))}

Prompt: {prompt}
Response: {response}

Identify specific violations:"""
        
        return self.model.generate(critique_prompt)
    
    def generate_revision(self, response: str, critique: str) -> str:
        """Revise response to address critique."""
        revision_prompt = f"""Please revise the response below to address 
        the identified critique.

Original response: {response}

Critique: {critique}

Revised response:"""
        
        return self.model.generate(revision_prompt)
    
    def preference_dataset_from_constitutional(
        self, prompt: str, initial_response: str
    ) -> tuple[str, str, str]:
        """Create preference pair from constitutional critique cycle."""
        critique = self.generate_critique(initial_response, prompt)
        revision = self.generate_revision(initial_response, critique)
        
        # Preference: revised response is better than initial
        return prompt, initial_response, revision

Training with Preference Pairs

def train_constitutional_preference_model(
    base_model, constitutional_ai, training_prompts, learning_rate=1e-5
):
    """Fine-tune model using constitutional preference pairs."""
    optimizer = torch.optim.AdamW(base_model.parameters(), lr=learning_rate)
    
    for prompt in training_prompts:
        initial = base_model.generate(prompt)
        prompt, response_a, response_b = constitutional_ai.preference_dataset_from_constitutional(
            prompt, initial
        )
        
        # Reward model prefers revision (response_b)
        reward_a = compute_safety_reward(prompt, response_a)
        reward_b = compute_safety_reward(prompt, response_b)
        
        preference_target = torch.tensor([0.0, 1.0])  # B preferred over A
        
        # Compute preference loss
        loss = preference_loss(
            reward_a, reward_b, preference_target
        )
        
        loss.backward()
        optimizer.step()
        optimizer.zero_grad()

Local verification checkpoint

Run the smallest example from this chapter in a local workspace and record the package version, runtime, data path, and observed output. If the result depends on model size, vector count, CPU/GPU backend, or available memory, note that constraint beside the exercise so the lesson remains reproducible.

EXERCISE

Define a custom constitution of 5 principles for a domain-specific application. Run the critique-revision cycle on 20 challenging prompts and measure how often revisions actually improve safety.

← Chapter 14
Safety Guardrails
Chapter 16 →
Output Filtering