04. Self-Consistency

Chapter 4 of 18 · 20 min

Chain-of-thought reasoning generates a single reasoning path. Self-consistency samples multiple reasoning paths and selects the answer that appears most frequently. The intuition: correct reasoning tends to converge on the same answer through different paths, while incorrect reasoning produces more varied answers.

This principle is disarmingly simple: generate N responses with chain-of-thought prompts, extract the final answer from each, count the answer frequencies. The most common answer is "most consistent" and is selected as the final output. This requires the question to have discrete, countable answer types—a medical diagnosis, a code implementation, a classification label.

Self-consistency works best when the model genuinely can solve the problem but might make reasoning errors on single attempts. Problems where the model lacks capability entirely don't benefit from multiple reasoning paths—the multiple wrong answers just produce a confident wrong answer more reliably.

from collections import Counter
import re

def extract_final_answer(response: str, answer_format: str = "letter") -> str:
    """Extract the final answer from a chain-of-thought response.
    
    Supports formats:
    - 'letter': extracts single letter like 'A', 'B', 'C'
    - 'number': extracts first number
    - 'yesno': maps to 'yes' or 'no'
    """
    response = response.strip()
    
    if answer_format == "letter":
        # Extract boxed answer or single letter at end
        match = re.search(r'\[?([A-Z])\]?', response)
        if match:
            return match.group(1)
        
        # Fall back to last letter on a line
        lines = [l.strip() for l in response.split('\n') if l.strip()]
        for line in reversed(lines):
            letters = re.findall(r'\b([A-Z])\b', line)
            if letters:
                return letters[-1]
    
    elif answer_format == "number":
        match = re.search(r'\b(\d+)\b', response)
        if match:
            return match.group(1)
    
    elif answer_format == "yesno":
        response_lower = response.lower()
        if 'yes' in response_lower[:50]:  # Check start for affirmation
            return "yes"
        elif 'no' in response_lower[:50]:
            return "no"
    
    return response[-20:]  # Fallback to last 20 chars


def self_consistency_sample(
    question: str,
    num_samples: int = 10,
    model: str = "llama3.2"
) -> str:
    """Sample N chain-of-thought responses, return most consistent answer."""
    
    cot_template = """Think through this problem step by step. 
Show your reasoning clearly, then state your final answer in brackets, e.g. [A] or [42].

Question: {question}

Reasoning:"""
    
    answers = []
    reasoning_samples = []
    
    for _ in range(num_samples):
        response = ollama.generate(
            model=model,
            prompt=cot_template.format(question=question),
            options={'temperature': 0.7}
        )
        
        reasoning = response['response']
        reasoning_samples.append(reasoning)
        
        answer = extract_final_answer(reasoning, answer_format="letter")
        answers.append(answer)
    
    # Count answer frequencies
    answer_counts = Counter(answers)
    most_common_answer, count = answer_counts.most_common(1)[0]
    
    consistency_ratio = count / num_samples
    
    return most_common_answer, {
        'total_samples': num_samples,
        'answer_counts': dict(answer_counts),
        'consistency_ratio': consistency_ratio,
        'winning_answer': most_common_answer
    }

The method returns confidence metadata: how many of the N samples produced the winning answer. Low consistency (e.g., 30% of samples agree) indicates the model doesn't have strong reasoning convergence, and the answer should be treated with caution or the question reformulated.

A failure mode worth noting: when answers have subtle text variations, the extractor fails to identify them as the same. "The answer is A" and "A is the correct answer" contain "A" but the extraction logic might miss one. Building reliable extractors for specific answer formats is a prerequisite for effective self-consistency.

Self-consistency typically provides diminishing returns past 10-20 samples for most domains. If 10 samples show 70% agreement, adding more samples rarely shifts the result significantly but increases cost linearly. The method is most valuable when the consistency ratio is borderline (40-60%) and you need to understand whether the variation is due to reasoning noise or question ambiguity.

EXERCISE

Take a questions-and-answers dataset from your domain (or create one with 20 questions). Run self-consistency with 5, 10, and 20 samples for each question. Plot consistency ratio vs. true answer accuracy. Document at what point additional samples stop improving accuracy.