16. Helpfulness vs Harmlessness
Chapter 16 of 24 · 20 min
The core tension in alignment training is balancing helpful responses (being maximally useful) against harmlessness (avoiding dangerous or inappropriate outputs). These goals conflict in edge cases.
The Tradeoff Landscape
Helpfulness ▲
│
HIGH │ OPTIMAL ZONE
│ (balanced assistance)
│
├─────────────────────► Harmlessness
LOW │ LOW HIGH
│
Tends │ Tends to
toward │ excessive
refuse │ caution
Explicit Tradeoff Weighting
The reward model learns implicit weights for helpfulness vs. harmlessness:
def compute_combined_reward(prompt, response, reward_model, safety_classifier):
# Helpfulness: reward model score
helpfulness = reward_model.score(prompt, response)
# Harmlessness: safety classifier score
safety_score = safety_classifier.predict(response)
# Combine with configurable weights
# lambda controls the helpfulness-harmlessness tradeoff
lambda_param = 0.3
combined = (1 - lambda_param) * helpfulness + lambda_param * (1 - safety_score)
return combined
Calibration for Edge Cases
Different request types require different weightings:
def get_adaptive_lambda(request_type):
"""Adjust helpfulness-harmlessness tradeoff per request type."""
if request_type in ["creative_writing", "general_knowledge"]:
return 0.1 # Emphasize helpfulness
elif request_type in ["medical_advice", "legal_advice"]:
return 0.6 # Emphasize caution
elif request_type in ["code_generation"]:
return 0.2 # Slight helpfulness emphasis
else:
return 0.3 # Balanced default
Refusal Calibration
Models often over-refuse on ambiguous requests:
def calibrate_refusal_threshold(model, threshold=0.5):
"""Adjust refusal behavior to match desired helpfulness level."""
# Evaluate on ambiguous prompts
test_prompts = load_ambiguous_prompts()
refusals = 0
for prompt in test_prompts:
response = model.generate(prompt, return_scores=True)
if response.refusal_score > threshold:
refusals += 1
refusal_rate = refusals / len(test_prompts)
print(f"Refusal rate on ambiguous: {refusal_rate:.1%}")
# If over-refusing, lower threshold
if refusal_rate > 0.15:
print("Warning: Over-refusing on benign prompts")
User Intent Disambiguation
Handling requests that could be harmful or benign:
def handle_ambiguous_request(prompt):
"""Respond appropriately to ambiguous requests."""
interpretation = classify_user_intent(prompt)
if interpretation.malicious_probability > 0.7:
return RefusalResponse("I'm not able to help with that.")
elif interpretation.benign_probability > 0.8:
return HelpfulResponse(prompt)
else:
# Ambiguous case: provide partial assistance
return PartialResponse(
"I can help with part of this request. Could you clarify...",
safe_portion=pick_safe_portion(prompt)
)
EXERCISE
Create a dataset of 30 ambiguous prompts (requests that could be benign or harmful). Evaluate your model on each and identify where it miscalibrates either direction.