RUNLOCALAIv38
->Will it run?Best GPUCompareTroubleshootStartLearnPulseModelsHardwareToolsBench
Run check
RUNLOCALAI

Independently operated catalog for local-AI hardware and software. Hand-written verdicts. Source-cited claims. Reproducible commands when we have them.

OP·Fredoline Eruo
DIR
  • Models
  • Hardware
  • Tools
  • Benchmarks
TOOLS
  • Will it run?
  • Compare hardware
  • Cost vs cloud
  • Choose my GPU
  • Prompting kits
  • Quick answers
REF
  • All buyer guides
  • Learn local AI
  • Methodology
  • Glossary
  • Errors KB
  • Trust
EDITOR
  • About
  • Author
  • How we make money
  • Editorial policy
  • Contact
LEGAL
  • Privacy
  • Terms
  • Sitemap
MAIL · MONTHLY DIGEST
Get monthly local AI changes
Monthly recap. No spam.
DISCLOSURE

Some links on this site are affiliate links (Amazon Associates and other first-class retailers). When you buy through them, we earn a small commission at no extra cost to you. Affiliate links do not influence our verdicts — there are cards we rate highly that we don't have affiliate relationships with, and cards that sell well that we refuse to recommend. Read more →

© 2026 runlocalai.coIndependently operated
RUNLOCALAI · v38
  1. >
  2. Home
  3. /Learn
  4. /Courses
  5. /RLHF, DPO, and PPO
  6. /Ch. 6
RLHF, DPO, and PPO

06. Reward Model Training

Chapter 6 of 24 · 20 min
KEY INSIGHT

Reward model quality determines the ceiling for your alignment procedure. A poor reward model cannot guide the policy to good outputs regardless of how you optimize. Invest in reward model evaluation before moving to policy optimization—you can catch most problems with targeted tests before wasting compute on RL training.

Reward models learn to predict human preferences. The architecture is typically a language model with a scalar head that outputs a single value per input. This value represents the "quality" of the response as judged by humans.

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
import torch.nn as nn

class RewardModel(nn.Module):
    def __init__(self, base_model_name):
        super().__init__()
        self.base_model = AutoModelForCausalLM.from_pretrained(base_model_name)
        # Remove the language modeling head
        self.base_model.lm_head = nn.Identity()
        # Add reward head
        hidden_size = self.base_model.config.hidden_size
        self.reward_head = nn.Linear(hidden_size, 1)
        
    def forward(self, input_ids, attention_mask):
        outputs = self.base_model(
            input_ids=input_ids,
            attention_mask=attention_mask
        )
        # Use the last token's hidden state for the reward
        last_hidden = outputs.last_hidden_state[:, -1, :]
        reward = self.reward_head(last_hidden).squeeze(-1)
        return reward

# Training loop
def reward_model_loss(chosen_ids, rejected_ids, attention_mask_c, attention_mask_r, reward_model):
    chosen_reward = reward_model(chosen_ids, attention_mask_c)
    rejected_reward = reward_model(rejected_ids, attention_mask_r)
    
    # Bradley-Terry model: prefer chosen over rejected
    # Loss = -log(sigmoid(chosen - rejected))
    loss = -torch.log(torch.sigmoid(chosen_reward - rejected_reward) + 1e-8)
    return loss.mean()

Critical design choice: which token to use for the reward signal. The standard approach uses the last token's hidden state. This assumes the model's representation at the end of the sequence captures overall quality. Alternatives include mean pooling over all tokens (better for shorter sequences) or using a specific token like [CLS] (if your tokenizer has one).

Failure mode: reward model overfitting to annotation artifacts. Human annotators have consistent biases—they might prefer longer responses, responses with certain words, or responses in specific formats. The reward model learns these artifacts rather than genuine quality. Watch for:

# Symptom: reward model assigns high scores to responses with specific patterns
# that weren't in the training data distribution
# Check: do high-reward responses share formatting patterns?

# Mitigation: add controls like response length bucketing during evaluation
length_bucket = len(response_tokens) // 100  # Buckets of 100 tokens
metrics_by_length = defaultdict(list)
for prompt, response, reward in eval_data:
    bucket = len(response) // 100
    metrics_by_length[bucket].append(reward)
    
# If reward correlates strongly with bucket, you have a length bias problem
EXERCISE

Train a reward model on a subset of preference data. Then create a test set with "adversarial" pairs: a good response with an obvious flaw (factual error, rude tone) versus a mediocre but technically correct response. Evaluate your reward model's ability to identify the genuinely better response. This tests whether the model learned actual quality or just surface statistics.

← Chapter 5
DPO Hyperparameters
Chapter 7 →
Data Collection