RUNLOCALAIv38
->Will it run?Best GPUCompareTroubleshootStartLearnPulseModelsHardwareToolsBench
Run check
RUNLOCALAI

Independently operated catalog for local-AI hardware and software. Hand-written verdicts. Source-cited claims. Reproducible commands when we have them.

OP·Fredoline Eruo
DIR
  • Models
  • Hardware
  • Tools
  • Benchmarks
TOOLS
  • Will it run?
  • Compare hardware
  • Cost vs cloud
  • Choose my GPU
  • Prompting kits
  • Quick answers
REF
  • All buyer guides
  • Learn local AI
  • Methodology
  • Glossary
  • Errors KB
  • Trust
EDITOR
  • About
  • Author
  • How we make money
  • Editorial policy
  • Contact
LEGAL
  • Privacy
  • Terms
  • Sitemap
MAIL · MONTHLY DIGEST
Get monthly local AI changes
Monthly recap. No spam.
DISCLOSURE

Some links on this site are affiliate links (Amazon Associates and other first-class retailers). When you buy through them, we earn a small commission at no extra cost to you. Affiliate links do not influence our verdicts — there are cards we rate highly that we don't have affiliate relationships with, and cards that sell well that we refuse to recommend. Read more →

© 2026 runlocalai.coIndependently operated
RUNLOCALAI · v38
  1. >
  2. Home
  3. /Learn
  4. /Courses
  5. /RLHF, DPO, and PPO
  6. /Ch. 13
RLHF, DPO, and PPO

13. Data Quality Filtering

Chapter 13 of 24 · 20 min
KEY INSIGHT

Data quality filtering is not a one-time preprocessing step—it must be an ongoing pipeline concern. The filtering thresholds that worked for initial training may be inappropriate for later iterations, and adversarial data requires constant monitoring and adaptation.

Training data quality is the primary determinant of aligned model behavior. Even sophisticated alignment techniques like RLHF fail when applied to noisy, inconsistent, or adversarial examples.

Preference Scoring Heuristics

The simplest filtering approach assigns quality scores to training pairs based on heuristic signals:

def compute_quality_score(prompt, response, metadata):
    score = 0.0
    
    # Length heuristics
    if len(response) < 50:
        score -= 0.5  # Likely too short
    if len(response) > 4000:
        score -= 0.3  # Possibly rambling
    
    # Repetition penalties
    ngrams = extract_ngrams(response, n=3)
    repetition_ratio = max(counts.values()) / sum(counts.values())
    if repetition_ratio > 0.3:
        score -= 0.4
    
    # Coherence signals (from metadata)
    if metadata.get("thumbs_up", 0) > metadata.get("thumbs_down", 0):
        score += 0.3
    
    # Specificity rewards
    if has_code_block(response):
        score += 0.2
    if has_citations(response):
        score += 0.15
    
    return score

Automated Quality Detection

Modern pipelines use classifiers to identify high-quality responses:

# Train a quality classifier on human-rated samples
python train_quality_classifier.py \
    --training-data human_ratings.parquet \
    --model bert-base \
    --output quality_classifier.pt

# Filter dataset with threshold
python filter_dataset.py \
    --input raw_preferences.jsonl \
    --classifier quality_classifier.pt \
    --threshold 0.75 \
    --output filtered_preferences.jsonl

Pairs vs. Singles Filtering

Filtering individual responses differs from filtering preference pairs:

Aspect Single Response Preference Pair
Criterion Response quality Relative quality
Signal Classifier score Consistency of preference
Failure mode Misses good responses Requires both responses valid

Adversarial Data Handling

Attackers intentionally create preference pairs designed to corrupt training:

def detect_adversarial_pair(prompt, chosen, rejected):
    # Check for prompt injection patterns
    if contains_injection_pattern(prompt):
        return True
    
    # Check for suspiciously close responses (hard to distinguish)
    similarity = compute_embedding_similarity(chosen, rejected)
    if similarity > 0.95:
        return True  # Both responses nearly identical
    
    # Check for reversed quality (jailbreak attempts)
    if is_jailbreak_prompt(prompt):
        if "sorry" in chosen.lower():
            return False  # Good response to jailbreak
        else:
            return True  # Suspicious
    
    return False
EXERCISE

Implement a quality scoring function that combines length, repetition, and classifier signals. Test it on a small dataset and compare filtered vs. unfiltered training curves.

← Chapter 12
Synthetic Preference Data
Chapter 14 →
Iterated Training