RUNLOCALAIv38
->Will it run?Best GPUCompareTroubleshootStartLearnPulseModelsHardwareToolsBench
Run check
RUNLOCALAI

Independently operated catalog for local-AI hardware and software. Hand-written verdicts. Source-cited claims. Reproducible commands when we have them.

OP·Fredoline Eruo
DIR
  • Models
  • Hardware
  • Tools
  • Benchmarks
TOOLS
  • Will it run?
  • Compare hardware
  • Cost vs cloud
  • Choose my GPU
  • Prompting kits
  • Quick answers
REF
  • All buyer guides
  • Learn local AI
  • Methodology
  • Glossary
  • Errors KB
  • Trust
EDITOR
  • About
  • Author
  • How we make money
  • Editorial policy
  • Contact
LEGAL
  • Privacy
  • Terms
  • Sitemap
MAIL · MONTHLY DIGEST
Get monthly local AI changes
Monthly recap. No spam.
DISCLOSURE

Some links on this site are affiliate links (Amazon Associates and other first-class retailers). When you buy through them, we earn a small commission at no extra cost to you. Affiliate links do not influence our verdicts — there are cards we rate highly that we don't have affiliate relationships with, and cards that sell well that we refuse to recommend. Read more →

© 2026 runlocalai.coIndependently operated
RUNLOCALAI · v38
  1. >
  2. Home
  3. /Learn
  4. /Courses
  5. /Advanced Prompt Engineering
  6. /Ch. 3
Advanced Prompt Engineering

03. Tree-of-Thought

Chapter 3 of 18 · 20 min
KEY INSIGHT

Tree-of-thought converts the reasoning problem from a single-path generation to a best-of-n exploration, with the tradeoff of multiplied compute and evaluation bias.

Chain-of-thought reasoning follows a single path: from input through intermediate steps to output. This is sufficient when reasoning steps are linearly dependent—when each step requires the previous. But many problems have branching structure: multiple possible approaches, and the best approach depends on which path the reasoning follows.

Tree-of-thought (ToT) prompting explicitly explores multiple reasoning paths rather than committing to the first plausible approach. The model generates multiple candidate next steps, evaluates each, prunes unlikely branches, and continues from promising nodes. This transforms single-answer generation into best-of-n search.

The practical implementation requires two distinct prompt types: a "generator" that proposes candidate next steps and a "evaluator" that scores candidates on viability. The generator prompts for a fixed number of alternatives (typically 3-5), and the evaluator assigns each a quality rating or viability score. The process then recurses from the top-rated candidates.

Consider a creative writing task where the user wants a plot twist. Basic prompting makes one choice and commits to it. ToT prompting would generate 4-5 possible twist directions, evaluate each for narrative coherence and surprise factor, select the two highest-scoring, and develop those more fully before making a final selection.

def generate_candidates(state: str, num_candidates: int = 4) -> list[str]:
    """Generate multiple candidate next steps."""
    prompt = f"""Given the current state, generate {num_candidates} distinct 
possible next steps or directions. Each should be a different approach.
Return them as a numbered list.

Current state: {state}

Possible next steps:"""

    response = ollama.generate(
        model='llama3.2',
        prompt=prompt,
        options={'temperature': 0.7}  # Higher temperature for diversity
    )
    
    candidates = []
    for line in response['response'].split('\n'):
        if line.strip() and line[0].isdigit():
            # Extract candidate after number
            candidate = line.split('.', 1)[-1].strip()
            candidates.append(candidate)
    
    return candidates[:num_candidates]


def evaluate_candidate(candidate: str, goal: str) -> float:
    """Score a candidate's viability for the goal."""
    prompt = f"""Rate this candidate approach on a scale of 0-10 for how well 
it moves toward the stated goal. Return only the number.

Candidate: {candidate}
Goal: {goal}

Score:"""

    response = ollama.generate(
        model='llama3.2',
        prompt=prompt,
        options={'temperature': 0.1}  # Low temperature for consistent scoring
    )
    
    try:
        score = float(response['response'].strip())
        return max(0.0, min(10.0, score))  # Clamp to valid range
    except ValueError:
        return 5.0  # Default mid-score on parse failure


def tree_of_thought(initial_state: str, goal: str, max_depth: int = 3) -> str:
    """Perform tree-of-thought exploration."""
    current_level = [initial_state]
    
    for depth in range(max_depth):
        next_level = []
        
        for state in current_level:
            candidates = generate_candidates(state)
            
            # Evaluate and sort candidates
            scored = [(c, evaluate_candidate(c, goal)) for c in candidates]
            scored.sort(key=lambda x: x[1], reverse=True)
            
            # Keep top candidates for next level
            top_candidates = [c for c, s in scored[:2]]
            next_level.extend(top_candidates)
        
        current_level = list(set(next_level))  # Deduplicate
    
    # Final selection from best scoring final candidates
    final_scores = [(s, evaluate_candidate(s, goal)) for s in current_level]
    final_scores.sort(key=lambda x: x[1], reverse=True)
    
    return final_scores[0][0]

# Example usage
story_hook = "A detective finds a watch with an inscription that suggests the victim was their birth parent"
goal = "A surprising but internally consistent plot twist"

final_twist = tree_of_thought(story_hook, goal)
print(final_twist)

A critical limitation: tree-of-thought multiplies API calls and context usage. Each depth level requires generating candidates and evaluating them. For simple tasks, this overhead exceeds the benefit. The method provides value primarily when the space of valid approaches is large and the cost of exploring wrong approaches is high.

One common failure: the evaluation step uses the same model being used for generation. The model evaluating its own candidates has the same biases as the model generating them. This leads to favoring paths that the model finds "comfortable" rather than objectively better paths. External validation or multi-model evaluation can partially address this.

EXERCISE

Identify a task where multiple reasonable approaches exist and the wrong approach wastes significant time. Implement tree-of-thought with depth 2 and 3 candidates. Measure whether the best ToT path outperforms the path a single CoT prompt would have taken. Document cases where ToT does not help.

← Chapter 2
Advanced Chain-of-Thought
Chapter 4 →
Self-Consistency