16. Self-Consistency

Chapter 16 of 25 · 15 min

KEY INSIGHT

The mechanism relies on answer agreement across independent reasoning paths, not on confidence calibration or metadata. ```python def self_consistency_query(model, problem, n_samples=5): """Generate multiple independent solutions and vote on answer.""" # Prompt each sample independently (different random seeds) samples = [] for i in range(n_samples): prompt = f"""Problem: {problem} Reason through this step by step. Show your reasoning. Your final answer should be clearly marked as: ANSWER: [your answer]""" response = model.generate(prompt, temperature=0.8, seed=i*42) samples.append(response) # Extract answers (simplified parsing) answers = [] for sample in samples: answer = extract_final_answer(sample) answers.append(answer) # Majority vote from collections import Counter vote_counts = Counter(answers) consensus_answer = vote_counts.most_common(1)[0][0] confidence = vote_counts.most_common(1)[0][1] / n_samples return consensus_answer, confidence, vote_counts ``` The temperature parameter controls stochasticity. Values below 0.3 produce near-identical samples, defeating the purpose. Values above 1.0 generate increasingly random output that loses solution validity. Verified optimal range: 0.6–0.9 for most models. **Failure mode:** Voting on answers without canonical format normalization produces false disagreements. The same mathematical answer may appear as "3", "three", "③", "=3". The voting mechanism counts these as distinct answers. ```python # Normalization step required before voting import re def normalize_answer(text): """Canonicalize answer formats before voting.""" # Remove punctuation text = re.sub(r'[^\w\s]', '', text) # Convert words to numbers where applicable num_words = { 'one': '1', 'two': '2', 'three': '3', 'first': '1', 'second': '2', 'third': '3' } text = text.lower() for word, num in num_words.items(): text = re.sub(rf'\b{word}\b', num, text) return text.strip() ``` Self-consistency with 20 samples improved accuracy on reasoning benchmarks by 4–9% over single-sample chain reasoning. The gain diminishes above 15 samples due to computational cost without proportional accuracy improvement.

Self-consistency prompting generates multiple solution paths for a single problem, then selects the most frequently occurring answer. The insight is that correct solutions converge while incorrect solutions diverge—even when they sound equally confident.

EXERCISE

Implement self-consistency querying for a code generation task. Generate 5 samples for each of 10 test cases, record the consensus answer and vote margin, then compare consensus accuracy against single-sample baseline accuracy.