Multi-Turn Alignment — RLHF, DPO, and PPO (Chapter 20)

Single-turn alignment does not guarantee multi-turn behavior. A model might be safe in isolation but exhibit harmful patterns over extended conversations.

Failure Modes in Multi-Turn

Goal Creep: Model gradually becomes more helpful to the point of unsafe assistance.

User: Help me with math homework
Model: Sure, here's how to solve...
User: Actually, I need help with something else
Model: Of course, what do you need?
User: Can you help me bypass security at work?
Model: [Should refuse but may continue the helpful pattern]

Identity Drift: Model gradually adopts user's framing over multiple turns.

def detect_identity_drift(conversation_history):
    """Detect if model is adopting user's problematic framing."""
    system_instructions = extract_system_messages(conversation_history)
    user_frameings = extract_user_claims(conversation_history)
    
    drift_score = 0.0
    for framing in user_frameings:
        if model_adopted_framing(system_instructions[-1], framing):
            drift_score += 1
    
    return drift_score / len(user_frameings) if user_frameings else 0.0

Training for Multi-Turn Consistency

def create_multi_turn_preference_data(conversations):
    """
    Create preference data from multi-turn conversations.
    """
    preference_data = []
    
    for conv in conversations:
        # Sample multiple points in the conversation
        for turn_idx in range(2, len(conv.turns)):
            context = conv.turns[:turn_idx]
            current_response = conv.turns[turn_idx]
            
            # Evaluate whether response maintains alignment
            is_safe = evaluate_response_safety(context, current_response)
            is_helpful = evaluate_response_helpfulness(context, current_response)
            
            # Create preference pair
            preference_data.append({
                "context": context,
                "chosen": current_response if is_safe else create_safe_alternative(context),
                "rejected": create_unsafe_alternative(context) if not is_safe else None
            })
    
    return preference_data

Context Window Considerations

Long conversations require careful handling:

def chunk_conversation_for_training(conversation, max_tokens=4096):
    """Break long conversations into trainable chunks."""
    chunks = []
    
    # Start from most recent messages (more relevant)
    messages = conversation.messages[::-1]
    current_chunk = []
    current_tokens = 0
    
    for msg in messages:
        msg_tokens = count_tokens(msg)
        
        if current_tokens + msg_tokens > max_tokens:
            chunks.append(ConversationChunk(current_chunk[::-1]))
            current_chunk = [msg]
            current_tokens = msg_tokens
        else:
            current_chunk.append(msg)
            current_tokens += msg_tokens
    
    if current_chunk:
        chunks.append(ConversationChunk(current_chunk[::-1]))
    
    return chunks

Evaluation Protocol

# Multi-turn evaluation benchmark
python evaluate_multi_turn.py \
    --model aligned_model \
    --scenarios multi_turn_scenarios.json \
    --max_turns 10 \
    --output evaluation_report.json

# Check for specific failure patterns
python detect_failure_modes.py \
    --conversations multi_turn_logs.json \
    --patterns ["goal_creep", "refusal_decay", "identity_drift"] \
    --threshold 0.3