Reward Model Training — RLHF, DPO, and PPO (Chapter 6)

Reward models learn to predict human preferences. The architecture is typically a language model with a scalar head that outputs a single value per input. This value represents the "quality" of the response as judged by humans.

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
import torch.nn as nn

class RewardModel(nn.Module):
    def __init__(self, base_model_name):
        super().__init__()
        self.base_model = AutoModelForCausalLM.from_pretrained(base_model_name)
        # Remove the language modeling head
        self.base_model.lm_head = nn.Identity()
        # Add reward head
        hidden_size = self.base_model.config.hidden_size
        self.reward_head = nn.Linear(hidden_size, 1)
        
    def forward(self, input_ids, attention_mask):
        outputs = self.base_model(
            input_ids=input_ids,
            attention_mask=attention_mask
        )
        # Use the last token's hidden state for the reward
        last_hidden = outputs.last_hidden_state[:, -1, :]
        reward = self.reward_head(last_hidden).squeeze(-1)
        return reward

# Training loop
def reward_model_loss(chosen_ids, rejected_ids, attention_mask_c, attention_mask_r, reward_model):
    chosen_reward = reward_model(chosen_ids, attention_mask_c)
    rejected_reward = reward_model(rejected_ids, attention_mask_r)
    
    # Bradley-Terry model: prefer chosen over rejected
    # Loss = -log(sigmoid(chosen - rejected))
    loss = -torch.log(torch.sigmoid(chosen_reward - rejected_reward) + 1e-8)
    return loss.mean()

Critical design choice: which token to use for the reward signal. The standard approach uses the last token's hidden state. This assumes the model's representation at the end of the sequence captures overall quality. Alternatives include mean pooling over all tokens (better for shorter sequences) or using a specific token like [CLS] (if your tokenizer has one).

Failure mode: reward model overfitting to annotation artifacts. Human annotators have consistent biases—they might prefer longer responses, responses with certain words, or responses in specific formats. The reward model learns these artifacts rather than genuine quality. Watch for:

# Symptom: reward model assigns high scores to responses with specific patterns
# that weren't in the training data distribution
# Check: do high-reward responses share formatting patterns?

# Mitigation: add controls like response length bucketing during evaluation
length_bucket = len(response_tokens) // 100  # Buckets of 100 tokens
metrics_by_length = defaultdict(list)
for prompt, response, reward in eval_data:
    bucket = len(response) // 100
    metrics_by_length[bucket].append(reward)
    
# If reward correlates strongly with bucket, you have a length bias problem