20. Multi-Turn Alignment
Chapter 20 of 24 · 20 min
Single-turn alignment does not guarantee multi-turn behavior. A model might be safe in isolation but exhibit harmful patterns over extended conversations.
Failure Modes in Multi-Turn
Goal Creep: Model gradually becomes more helpful to the point of unsafe assistance.
User: Help me with math homework
Model: Sure, here's how to solve...
User: Actually, I need help with something else
Model: Of course, what do you need?
User: Can you help me bypass security at work?
Model: [Should refuse but may continue the helpful pattern]
Identity Drift: Model gradually adopts user's framing over multiple turns.
def detect_identity_drift(conversation_history):
"""Detect if model is adopting user's problematic framing."""
system_instructions = extract_system_messages(conversation_history)
user_frameings = extract_user_claims(conversation_history)
drift_score = 0.0
for framing in user_frameings:
if model_adopted_framing(system_instructions[-1], framing):
drift_score += 1
return drift_score / len(user_frameings) if user_frameings else 0.0
Training for Multi-Turn Consistency
def create_multi_turn_preference_data(conversations):
"""
Create preference data from multi-turn conversations.
"""
preference_data = []
for conv in conversations:
# Sample multiple points in the conversation
for turn_idx in range(2, len(conv.turns)):
context = conv.turns[:turn_idx]
current_response = conv.turns[turn_idx]
# Evaluate whether response maintains alignment
is_safe = evaluate_response_safety(context, current_response)
is_helpful = evaluate_response_helpfulness(context, current_response)
# Create preference pair
preference_data.append({
"context": context,
"chosen": current_response if is_safe else create_safe_alternative(context),
"rejected": create_unsafe_alternative(context) if not is_safe else None
})
return preference_data
Context Window Considerations
Long conversations require careful handling:
def chunk_conversation_for_training(conversation, max_tokens=4096):
"""Break long conversations into trainable chunks."""
chunks = []
# Start from most recent messages (more relevant)
messages = conversation.messages[::-1]
current_chunk = []
current_tokens = 0
for msg in messages:
msg_tokens = count_tokens(msg)
if current_tokens + msg_tokens > max_tokens:
chunks.append(ConversationChunk(current_chunk[::-1]))
current_chunk = [msg]
current_tokens = msg_tokens
else:
current_chunk.append(msg)
current_tokens += msg_tokens
if current_chunk:
chunks.append(ConversationChunk(current_chunk[::-1]))
return chunks
Evaluation Protocol
# Multi-turn evaluation benchmark
python evaluate_multi_turn.py \
--model aligned_model \
--scenarios multi_turn_scenarios.json \
--max_turns 10 \
--output evaluation_report.json
# Check for specific failure patterns
python detect_failure_modes.py \
--conversations multi_turn_logs.json \
--patterns ["goal_creep", "refusal_decay", "identity_drift"] \
--threshold 0.3
EXERCISE
Create a multi-turn evaluation suite with 20 conversations of 5-10 turns each. Test your aligned model for goal creep and refusal decay across the conversations.