RUNLOCALAIv38
->Will it run?Best GPUCompareTroubleshootStartLearnPulseModelsHardwareToolsBench
Run check
RUNLOCALAI

Independently operated catalog for local-AI hardware and software. Hand-written verdicts. Source-cited claims. Reproducible commands when we have them.

OP·Fredoline Eruo
DIR
  • Models
  • Hardware
  • Tools
  • Benchmarks
TOOLS
  • Will it run?
  • Compare hardware
  • Cost vs cloud
  • Choose my GPU
  • Prompting kits
  • Quick answers
REF
  • All buyer guides
  • Learn local AI
  • Methodology
  • Glossary
  • Errors KB
  • Trust
EDITOR
  • About
  • Author
  • How we make money
  • Editorial policy
  • Contact
LEGAL
  • Privacy
  • Terms
  • Sitemap
MAIL · MONTHLY DIGEST
Get monthly local AI changes
Monthly recap. No spam.
DISCLOSURE

Some links on this site are affiliate links (Amazon Associates and other first-class retailers). When you buy through them, we earn a small commission at no extra cost to you. Affiliate links do not influence our verdicts — there are cards we rate highly that we don't have affiliate relationships with, and cards that sell well that we refuse to recommend. Read more →

© 2026 runlocalai.coIndependently operated
RUNLOCALAI · v38
  1. >
  2. Home
  3. /Learn
  4. /Courses
  5. /RLHF, DPO, and PPO
  6. /Ch. 20
RLHF, DPO, and PPO

20. Multi-Turn Alignment

Chapter 20 of 24 · 20 min
KEY INSIGHT

Multi-turn alignment cannot be achieved through single-turn training alone. The model must learn to maintain consistent behavior across extended conversations, which requires both training data that includes multi-turn examples and evaluation protocols that test for cumulative failure modes.

Single-turn alignment does not guarantee multi-turn behavior. A model might be safe in isolation but exhibit harmful patterns over extended conversations.

Failure Modes in Multi-Turn

Goal Creep: Model gradually becomes more helpful to the point of unsafe assistance.

User: Help me with math homework
Model: Sure, here's how to solve...
User: Actually, I need help with something else
Model: Of course, what do you need?
User: Can you help me bypass security at work?
Model: [Should refuse but may continue the helpful pattern]

Identity Drift: Model gradually adopts user's framing over multiple turns.

def detect_identity_drift(conversation_history):
    """Detect if model is adopting user's problematic framing."""
    system_instructions = extract_system_messages(conversation_history)
    user_frameings = extract_user_claims(conversation_history)
    
    drift_score = 0.0
    for framing in user_frameings:
        if model_adopted_framing(system_instructions[-1], framing):
            drift_score += 1
    
    return drift_score / len(user_frameings) if user_frameings else 0.0

Training for Multi-Turn Consistency

def create_multi_turn_preference_data(conversations):
    """
    Create preference data from multi-turn conversations.
    """
    preference_data = []
    
    for conv in conversations:
        # Sample multiple points in the conversation
        for turn_idx in range(2, len(conv.turns)):
            context = conv.turns[:turn_idx]
            current_response = conv.turns[turn_idx]
            
            # Evaluate whether response maintains alignment
            is_safe = evaluate_response_safety(context, current_response)
            is_helpful = evaluate_response_helpfulness(context, current_response)
            
            # Create preference pair
            preference_data.append({
                "context": context,
                "chosen": current_response if is_safe else create_safe_alternative(context),
                "rejected": create_unsafe_alternative(context) if not is_safe else None
            })
    
    return preference_data

Context Window Considerations

Long conversations require careful handling:

def chunk_conversation_for_training(conversation, max_tokens=4096):
    """Break long conversations into trainable chunks."""
    chunks = []
    
    # Start from most recent messages (more relevant)
    messages = conversation.messages[::-1]
    current_chunk = []
    current_tokens = 0
    
    for msg in messages:
        msg_tokens = count_tokens(msg)
        
        if current_tokens + msg_tokens > max_tokens:
            chunks.append(ConversationChunk(current_chunk[::-1]))
            current_chunk = [msg]
            current_tokens = msg_tokens
        else:
            current_chunk.append(msg)
            current_tokens += msg_tokens
    
    if current_chunk:
        chunks.append(ConversationChunk(current_chunk[::-1]))
    
    return chunks

Evaluation Protocol

# Multi-turn evaluation benchmark
python evaluate_multi_turn.py \
    --model aligned_model \
    --scenarios multi_turn_scenarios.json \
    --max_turns 10 \
    --output evaluation_report.json

# Check for specific failure patterns
python detect_failure_modes.py \
    --conversations multi_turn_logs.json \
    --patterns ["goal_creep", "refusal_decay", "identity_drift"] \
    --threshold 0.3
EXERCISE

Create a multi-turn evaluation suite with 20 conversations of 5-10 turns each. Test your aligned model for goal creep and refusal decay across the conversations.

← Chapter 19
ORPO
Chapter 21 →
Catastrophic Forgetting