14. Verification Loops

Chapter 14 of 18 · 20 min

A reasoning model that produces an answer is only as reliable as its verification process. Verification loops—where the model checks its own reasoning—catch errors before they reach users.

The Self-Verification Principle

Humans verify their own work by re-checking constraints and re-deriving results from different angles. Reasoning models can implement similar loops:

  1. Generate candidate reasoning chain
  2. Verify each step against problem constraints
  3. Re-derive final answer from verified steps
  4. If verification fails, regenerate or expand reasoning
def verify_reasoning_chain(problem, reasoning_chain, answer):
    """Verify a reasoning chain against problem constraints"""
    verification_results = []
    
    for step_idx, step in enumerate(reasoning_chain):
        # Check each step
        step_valid = check_step_syntax(step)
        step_consistent = check_step_consistency(step, reasoning_chain[:step_idx])
        step_constraints_ok = check_constraints(step, problem.constraints)
        
        verification_results.append({
            'step': step_idx,
            'valid': step_valid,
            'consistent': step_consistent,
            'constraints_satisfied': step_constraints_ok
        })
        
        if not (step_valid and step_consistent and step_constraints_ok):
            return {
                'verified': False,
                'failed_step': step_idx,
                'failure_reasons': verification_results[-1]
            }
    
    # Re-derive from steps to catch hidden errors
    rederived = rederive_from_steps(reasoning_chain)
    if rederived != answer:
        return {'verified': False, 'mismatch': 'rederivation_failed'}
    
    return {'verified': True, 'steps': verification_results}

Outcome-Based vs. Step-Based Verification

Two verification approaches offer different tradeoffs:

Outcome-based verification checks only the final answer without examining reasoning quality. This catches obvious failures but misses reasoning errors that happen to produce correct answers.

Step-based verification examines each inference step. This catches logical failures but requires the model to generate verification metadata and adds latency.

The recommendation is step-based verification for high-stakes decisions (legal, medical, financial) and outcome-based for high-volume, low-stakes queries (customer service, content recommendations).

# Decision table for verification depth
VERIFICATION_STRATEGY = {
    'high_stakes': {
        'mode': 'step_based',
        'auto_retry_on_failure': True,
        'max_generations': 3
    },
    'medium_stakes': {
        'mode': 'outcome_based_with_sampling',
        'sample_rate': 0.2,
        'step_check_on_sample_failure': True
    },
    'low_stakes': {
        'mode': 'outcome_based',
        'no_retry': True
    }
}

Sampling-Based Verification

For high-volume applications, verify by sampling: select a fraction of responses for detailed step-based verification, then extrapolate confidence to the full population.

def sample_verify_responses(responses, sample_rate=0.1):
    """Verify a sample of responses and estimate population quality"""
    sample_size = int(len(responses) * sample_rate)
    indices = random.sample(range(len(responses)), sample_size)
    
    sample_results = []
    for idx in indices:
        result = verify_reasoning_chain(
            responses[idx]['problem'],
            responses[idx]['reasoning'],
            responses[idx]['answer']
        )
        sample_results.append(result)
    
    verified_rate = sum(1 for r in sample_results if r['verified']) / len(sample_results)
    
    return {
        'estimated_population_verified_rate': verified_rate,
        'sample_results': sample_results,
        'confidence_interval': compute_ci(sample_results, sample_size, len(responses))
    }

Failure mode: Sampling verification assumes your sample is representative. If you sample during low-traffic periods or from specific problem types, your estimate may not generalize.

The Infinite Loop Risk

Verification loops can fail to terminate if:

  • The model keeps generating incorrect steps
  • Verification criteria are too strict (false positive errors)
  • The model enters a regeneration cycle without improving

Set hard limits on verification iterations and monitor regeneration rates. A regeneration rate above 50% indicates the model frequently fails first-attempt reasoning—a training or prompting problem, not a verification problem.

EXERCISE

Implement a verification function for your reasoning outputs. Run it on 50 recent responses and calculate your verification failure rate. If failures exceed 10%, examine whether failures cluster by problem type or reasoning pattern.