03. Tree-of-Thought
Chain-of-thought reasoning follows a single path: from input through intermediate steps to output. This is sufficient when reasoning steps are linearly dependent—when each step requires the previous. But many problems have branching structure: multiple possible approaches, and the best approach depends on which path the reasoning follows.
Tree-of-thought (ToT) prompting explicitly explores multiple reasoning paths rather than committing to the first plausible approach. The model generates multiple candidate next steps, evaluates each, prunes unlikely branches, and continues from promising nodes. This transforms single-answer generation into best-of-n search.
The practical implementation requires two distinct prompt types: a "generator" that proposes candidate next steps and a "evaluator" that scores candidates on viability. The generator prompts for a fixed number of alternatives (typically 3-5), and the evaluator assigns each a quality rating or viability score. The process then recurses from the top-rated candidates.
Consider a creative writing task where the user wants a plot twist. Basic prompting makes one choice and commits to it. ToT prompting would generate 4-5 possible twist directions, evaluate each for narrative coherence and surprise factor, select the two highest-scoring, and develop those more fully before making a final selection.
def generate_candidates(state: str, num_candidates: int = 4) -> list[str]:
"""Generate multiple candidate next steps."""
prompt = f"""Given the current state, generate {num_candidates} distinct
possible next steps or directions. Each should be a different approach.
Return them as a numbered list.
Current state: {state}
Possible next steps:"""
response = ollama.generate(
model='llama3.2',
prompt=prompt,
options={'temperature': 0.7} # Higher temperature for diversity
)
candidates = []
for line in response['response'].split('\n'):
if line.strip() and line[0].isdigit():
# Extract candidate after number
candidate = line.split('.', 1)[-1].strip()
candidates.append(candidate)
return candidates[:num_candidates]
def evaluate_candidate(candidate: str, goal: str) -> float:
"""Score a candidate's viability for the goal."""
prompt = f"""Rate this candidate approach on a scale of 0-10 for how well
it moves toward the stated goal. Return only the number.
Candidate: {candidate}
Goal: {goal}
Score:"""
response = ollama.generate(
model='llama3.2',
prompt=prompt,
options={'temperature': 0.1} # Low temperature for consistent scoring
)
try:
score = float(response['response'].strip())
return max(0.0, min(10.0, score)) # Clamp to valid range
except ValueError:
return 5.0 # Default mid-score on parse failure
def tree_of_thought(initial_state: str, goal: str, max_depth: int = 3) -> str:
"""Perform tree-of-thought exploration."""
current_level = [initial_state]
for depth in range(max_depth):
next_level = []
for state in current_level:
candidates = generate_candidates(state)
# Evaluate and sort candidates
scored = [(c, evaluate_candidate(c, goal)) for c in candidates]
scored.sort(key=lambda x: x[1], reverse=True)
# Keep top candidates for next level
top_candidates = [c for c, s in scored[:2]]
next_level.extend(top_candidates)
current_level = list(set(next_level)) # Deduplicate
# Final selection from best scoring final candidates
final_scores = [(s, evaluate_candidate(s, goal)) for s in current_level]
final_scores.sort(key=lambda x: x[1], reverse=True)
return final_scores[0][0]
# Example usage
story_hook = "A detective finds a watch with an inscription that suggests the victim was their birth parent"
goal = "A surprising but internally consistent plot twist"
final_twist = tree_of_thought(story_hook, goal)
print(final_twist)
A critical limitation: tree-of-thought multiplies API calls and context usage. Each depth level requires generating candidates and evaluating them. For simple tasks, this overhead exceeds the benefit. The method provides value primarily when the space of valid approaches is large and the cost of exploring wrong approaches is high.
One common failure: the evaluation step uses the same model being used for generation. The model evaluating its own candidates has the same biases as the model generating them. This leads to favoring paths that the model finds "comfortable" rather than objectively better paths. External validation or multi-model evaluation can partially address this.
Identify a task where multiple reasonable approaches exist and the wrong approach wastes significant time. Implement tree-of-thought with depth 2 and 3 candidates. Measure whether the best ToT path outperforms the path a single CoT prompt would have taken. Document cases where ToT does not help.