16. Helpfulness vs Harmlessness

Chapter 16 of 24 · 20 min

The core tension in alignment training is balancing helpful responses (being maximally useful) against harmlessness (avoiding dangerous or inappropriate outputs). These goals conflict in edge cases.

The Tradeoff Landscape

Helpfulness  ▲
            │
     HIGH   │     OPTIMAL ZONE
            │   (balanced assistance)
            │
            ├─────────────────────► Harmlessness
      LOW   │   LOW          HIGH
            │
     Tends  │     Tends to
     toward │     excessive
     refuse │     caution

Explicit Tradeoff Weighting

The reward model learns implicit weights for helpfulness vs. harmlessness:

def compute_combined_reward(prompt, response, reward_model, safety_classifier):
    # Helpfulness: reward model score
    helpfulness = reward_model.score(prompt, response)
    
    # Harmlessness: safety classifier score
    safety_score = safety_classifier.predict(response)
    
    # Combine with configurable weights
    # lambda controls the helpfulness-harmlessness tradeoff
    lambda_param = 0.3
    
    combined = (1 - lambda_param) * helpfulness + lambda_param * (1 - safety_score)
    
    return combined

Calibration for Edge Cases

Different request types require different weightings:

def get_adaptive_lambda(request_type):
    """Adjust helpfulness-harmlessness tradeoff per request type."""
    if request_type in ["creative_writing", "general_knowledge"]:
        return 0.1  # Emphasize helpfulness
    elif request_type in ["medical_advice", "legal_advice"]:
        return 0.6  # Emphasize caution
    elif request_type in ["code_generation"]:
        return 0.2  # Slight helpfulness emphasis
    else:
        return 0.3  # Balanced default

Refusal Calibration

Models often over-refuse on ambiguous requests:

def calibrate_refusal_threshold(model, threshold=0.5):
    """Adjust refusal behavior to match desired helpfulness level."""
    # Evaluate on ambiguous prompts
    test_prompts = load_ambiguous_prompts()
    
    refusals = 0
    for prompt in test_prompts:
        response = model.generate(prompt, return_scores=True)
        if response.refusal_score > threshold:
            refusals += 1
    
    refusal_rate = refusals / len(test_prompts)
    print(f"Refusal rate on ambiguous: {refusal_rate:.1%}")
    
    # If over-refusing, lower threshold
    if refusal_rate > 0.15:
        print("Warning: Over-refusing on benign prompts")

User Intent Disambiguation

Handling requests that could be harmful or benign:

def handle_ambiguous_request(prompt):
    """Respond appropriately to ambiguous requests."""
    interpretation = classify_user_intent(prompt)
    
    if interpretation.malicious_probability > 0.7:
        return RefusalResponse("I'm not able to help with that.")
    elif interpretation.benign_probability > 0.8:
        return HelpfulResponse(prompt)
    else:
        # Ambiguous case: provide partial assistance
        return PartialResponse(
            "I can help with part of this request. Could you clarify...",
            safe_portion=pick_safe_portion(prompt)
        )
EXERCISE

Create a dataset of 30 ambiguous prompts (requests that could be benign or harmful). Evaluate your model on each and identify where it miscalibrates either direction.