Evaluation of Reasoning — DeepSeek R1 and Reasoning Models (Chapter 11)

Evaluating reasoning models requires more than accuracy metrics. A model that achieves 95% accuracy on arithmetic but produces invalid reasoning chains isn't reasoning—it's pattern-matching. Operators need systematic evaluation frameworks that catch this distinction.

The Correctness Fallacy

A common failure mode is conflating output correctness with reasoning quality. A model might arrive at correct answers through buggy intermediate steps that happen to cancel out, or through lucky guesswork. This distinction matters operationally: a model with flawed reasoning will fail on slightly modified problems.

Consider a model solving a geometry proof. It might state "angle A equals angle B because they're both 90 degrees" when the diagram shows angle A as 85 degrees and angle B as 90 degrees. The final answer might coincidentally be correct, but the reasoning violates the problem's constraints.

Evaluation Dimensions

Effective reasoning evaluation examines five dimensions:

Correctness: Does the final answer match the problem's answer? This is necessary but insufficient.

Completeness: Are all necessary steps present? A reasoning chain missing a crucial transformation step is incomplete even if the answer is correct.

Consistency: Do intermediate claims hold given the problem constraints? Check each step against stated facts and prior steps.

Generalization: Does the reasoning apply to structurally similar problems? Test with problem variants.

Efficiency: Is the reasoning path minimal? Excessive steps often indicate confusion rather than thoroughness.

def evaluate_reasoning(problem, reasoning_chain, final_answer, ground_truth):
    """Multi-dimensional reasoning evaluation"""
    results = {
        'correctness': final_answer == ground_truth,
        'completeness': check_completeness(reasoning_chain, problem),
        'consistency': check_consistency(reasoning_chain, problem),
        'generalization': test_generalization(reasoning_chain, problem_type),
        'efficiency': len(reasoning_chain) / expected_length(problem_type)
    }
    
    # Weighted composite score
    weights = {'correctness': 0.3, 'completeness': 0.2, 
               'consistency': 0.25, 'generalization': 0.15, 'efficiency': 0.1}
    
    composite = sum(results[k] * weights[k] for k in weights)
    return results, composite

Consistency Testing

For production reasoning systems, consistency testing catches regressions before they impact users. This involves:

Identify problem categories where the model consistently fails
Create test variants that share underlying logic
Run consistency checks every deployment cycle
Track consistency drift over time

A model that correctly solves "2x + 5 = 15, find x" but fails "3x + 7 = 16, find x" has a generalization failure—even though both answers are correct. The failure indicates the model learned surface features rather than algebraic manipulation.

Ground Truth Generation Failure

Ground truth evaluation depends on correct ground truths. Math benchmark contamination is well-documented: models trained on internet data have seen test problems. GSM8K contamination is particularly problematic because many problems are simple enough to appear frequently in training data.

For reliable evaluation, use held-out benchmarks, synthetic test generation, or private problem banks. Assume any public benchmark result is an upper bound on true performance.