13. Data Quality Filtering
Chapter 13 of 24 · 20 min
Training data quality is the primary determinant of aligned model behavior. Even sophisticated alignment techniques like RLHF fail when applied to noisy, inconsistent, or adversarial examples.
Preference Scoring Heuristics
The simplest filtering approach assigns quality scores to training pairs based on heuristic signals:
def compute_quality_score(prompt, response, metadata):
score = 0.0
# Length heuristics
if len(response) < 50:
score -= 0.5 # Likely too short
if len(response) > 4000:
score -= 0.3 # Possibly rambling
# Repetition penalties
ngrams = extract_ngrams(response, n=3)
repetition_ratio = max(counts.values()) / sum(counts.values())
if repetition_ratio > 0.3:
score -= 0.4
# Coherence signals (from metadata)
if metadata.get("thumbs_up", 0) > metadata.get("thumbs_down", 0):
score += 0.3
# Specificity rewards
if has_code_block(response):
score += 0.2
if has_citations(response):
score += 0.15
return score
Automated Quality Detection
Modern pipelines use classifiers to identify high-quality responses:
# Train a quality classifier on human-rated samples
python train_quality_classifier.py \
--training-data human_ratings.parquet \
--model bert-base \
--output quality_classifier.pt
# Filter dataset with threshold
python filter_dataset.py \
--input raw_preferences.jsonl \
--classifier quality_classifier.pt \
--threshold 0.75 \
--output filtered_preferences.jsonl
Pairs vs. Singles Filtering
Filtering individual responses differs from filtering preference pairs:
| Aspect | Single Response | Preference Pair |
|---|---|---|
| Criterion | Response quality | Relative quality |
| Signal | Classifier score | Consistency of preference |
| Failure mode | Misses good responses | Requires both responses valid |
Adversarial Data Handling
Attackers intentionally create preference pairs designed to corrupt training:
def detect_adversarial_pair(prompt, chosen, rejected):
# Check for prompt injection patterns
if contains_injection_pattern(prompt):
return True
# Check for suspiciously close responses (hard to distinguish)
similarity = compute_embedding_similarity(chosen, rejected)
if similarity > 0.95:
return True # Both responses nearly identical
# Check for reversed quality (jailbreak attempts)
if is_jailbreak_prompt(prompt):
if "sorry" in chosen.lower():
return False # Good response to jailbreak
else:
return True # Suspicious
return False
EXERCISE
Implement a quality scoring function that combines length, repetition, and classifier signals. Test it on a small dataset and compare filtered vs. unfiltered training curves.