06. Reward Model Training
Reward models learn to predict human preferences. The architecture is typically a language model with a scalar head that outputs a single value per input. This value represents the "quality" of the response as judged by humans.
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
import torch.nn as nn
class RewardModel(nn.Module):
def __init__(self, base_model_name):
super().__init__()
self.base_model = AutoModelForCausalLM.from_pretrained(base_model_name)
# Remove the language modeling head
self.base_model.lm_head = nn.Identity()
# Add reward head
hidden_size = self.base_model.config.hidden_size
self.reward_head = nn.Linear(hidden_size, 1)
def forward(self, input_ids, attention_mask):
outputs = self.base_model(
input_ids=input_ids,
attention_mask=attention_mask
)
# Use the last token's hidden state for the reward
last_hidden = outputs.last_hidden_state[:, -1, :]
reward = self.reward_head(last_hidden).squeeze(-1)
return reward
# Training loop
def reward_model_loss(chosen_ids, rejected_ids, attention_mask_c, attention_mask_r, reward_model):
chosen_reward = reward_model(chosen_ids, attention_mask_c)
rejected_reward = reward_model(rejected_ids, attention_mask_r)
# Bradley-Terry model: prefer chosen over rejected
# Loss = -log(sigmoid(chosen - rejected))
loss = -torch.log(torch.sigmoid(chosen_reward - rejected_reward) + 1e-8)
return loss.mean()
Critical design choice: which token to use for the reward signal. The standard approach uses the last token's hidden state. This assumes the model's representation at the end of the sequence captures overall quality. Alternatives include mean pooling over all tokens (better for shorter sequences) or using a specific token like [CLS] (if your tokenizer has one).
Failure mode: reward model overfitting to annotation artifacts. Human annotators have consistent biases—they might prefer longer responses, responses with certain words, or responses in specific formats. The reward model learns these artifacts rather than genuine quality. Watch for:
# Symptom: reward model assigns high scores to responses with specific patterns
# that weren't in the training data distribution
# Check: do high-reward responses share formatting patterns?
# Mitigation: add controls like response length bucketing during evaluation
length_bucket = len(response_tokens) // 100 # Buckets of 100 tokens
metrics_by_length = defaultdict(list)
for prompt, response, reward in eval_data:
bucket = len(response) // 100
metrics_by_length[bucket].append(reward)
# If reward correlates strongly with bucket, you have a length bias problem
Train a reward model on a subset of preference data. Then create a test set with "adversarial" pairs: a good response with an obvious flaw (factual error, rude tone) versus a mediocre but technically correct response. Evaluate your reward model's ability to identify the genuinely better response. This tests whether the model learned actual quality or just surface statistics.