14. Verification Loops
A reasoning model that produces an answer is only as reliable as its verification process. Verification loops—where the model checks its own reasoning—catch errors before they reach users.
The Self-Verification Principle
Humans verify their own work by re-checking constraints and re-deriving results from different angles. Reasoning models can implement similar loops:
- Generate candidate reasoning chain
- Verify each step against problem constraints
- Re-derive final answer from verified steps
- If verification fails, regenerate or expand reasoning
def verify_reasoning_chain(problem, reasoning_chain, answer):
"""Verify a reasoning chain against problem constraints"""
verification_results = []
for step_idx, step in enumerate(reasoning_chain):
# Check each step
step_valid = check_step_syntax(step)
step_consistent = check_step_consistency(step, reasoning_chain[:step_idx])
step_constraints_ok = check_constraints(step, problem.constraints)
verification_results.append({
'step': step_idx,
'valid': step_valid,
'consistent': step_consistent,
'constraints_satisfied': step_constraints_ok
})
if not (step_valid and step_consistent and step_constraints_ok):
return {
'verified': False,
'failed_step': step_idx,
'failure_reasons': verification_results[-1]
}
# Re-derive from steps to catch hidden errors
rederived = rederive_from_steps(reasoning_chain)
if rederived != answer:
return {'verified': False, 'mismatch': 'rederivation_failed'}
return {'verified': True, 'steps': verification_results}
Outcome-Based vs. Step-Based Verification
Two verification approaches offer different tradeoffs:
Outcome-based verification checks only the final answer without examining reasoning quality. This catches obvious failures but misses reasoning errors that happen to produce correct answers.
Step-based verification examines each inference step. This catches logical failures but requires the model to generate verification metadata and adds latency.
The recommendation is step-based verification for high-stakes decisions (legal, medical, financial) and outcome-based for high-volume, low-stakes queries (customer service, content recommendations).
# Decision table for verification depth
VERIFICATION_STRATEGY = {
'high_stakes': {
'mode': 'step_based',
'auto_retry_on_failure': True,
'max_generations': 3
},
'medium_stakes': {
'mode': 'outcome_based_with_sampling',
'sample_rate': 0.2,
'step_check_on_sample_failure': True
},
'low_stakes': {
'mode': 'outcome_based',
'no_retry': True
}
}
Sampling-Based Verification
For high-volume applications, verify by sampling: select a fraction of responses for detailed step-based verification, then extrapolate confidence to the full population.
def sample_verify_responses(responses, sample_rate=0.1):
"""Verify a sample of responses and estimate population quality"""
sample_size = int(len(responses) * sample_rate)
indices = random.sample(range(len(responses)), sample_size)
sample_results = []
for idx in indices:
result = verify_reasoning_chain(
responses[idx]['problem'],
responses[idx]['reasoning'],
responses[idx]['answer']
)
sample_results.append(result)
verified_rate = sum(1 for r in sample_results if r['verified']) / len(sample_results)
return {
'estimated_population_verified_rate': verified_rate,
'sample_results': sample_results,
'confidence_interval': compute_ci(sample_results, sample_size, len(responses))
}
Failure mode: Sampling verification assumes your sample is representative. If you sample during low-traffic periods or from specific problem types, your estimate may not generalize.
The Infinite Loop Risk
Verification loops can fail to terminate if:
- The model keeps generating incorrect steps
- Verification criteria are too strict (false positive errors)
- The model enters a regeneration cycle without improving
Set hard limits on verification iterations and monitor regeneration rates. A regeneration rate above 50% indicates the model frequently fails first-attempt reasoning—a training or prompting problem, not a verification problem.
Implement a verification function for your reasoning outputs. Run it on 50 recent responses and calculate your verification failure rate. If failures exceed 10%, examine whether failures cluster by problem type or reasoning pattern.