Large language models

Tree of Thoughts

Tree of Thoughts (ToT) is a prompting strategy that expands a single chain of reasoning into a tree of multiple reasoning paths, evaluating each step with a heuristic (e.g., a model self-score) and pruning low-probability branches. Operators encounter ToT when they want to solve multi-step problems (e.g., math puzzles, planning) where a single chain-of-thought often fails. ToT requires multiple model calls per step, increasing latency and token usage, but can improve accuracy on tasks that benefit from exploration and backtracking.

Deeper dive

Tree of Thoughts generalizes Chain-of-Thought (CoT) by allowing the model to explore multiple reasoning paths at each step. At each step, the model generates several candidate 'thoughts' (e.g., via sampling with temperature > 0). A heuristic evaluator (often the same model, prompted to score) assesses each candidate's promise. The top-k candidates are kept, and the process repeats. This breadth-first search continues until a solution is found or a depth limit is reached. ToT can be implemented with a loop in Python using an LLM API, but it is not natively supported in most local inference servers like llama.cpp or Ollama. Operators typically implement ToT via custom scripts using Hugging Face Transformers or an OpenAI-compatible endpoint. The cost is high: a single ToT run may consume 10–100x more tokens than a simple CoT.

Practical example

An operator wants to solve a 24-game puzzle (use 4 numbers to make 24). With CoT, the model might guess an expression and fail. With ToT, the operator writes a script that at each step asks the model to propose 3 candidate next operations (e.g., 'add 3 and 5'), scores each with a prompt like 'Is this step promising?', keeps the top 2, and recurses. On an RTX 4090, each step may take ~1 second, and a full tree of depth 4 with branching factor 3 might take 40+ seconds and 10k tokens.

Workflow example

In practice, an operator using Ollama would not run ToT directly; instead they'd write a Python script that calls ollama run llama3.1:8b in a loop. The script maintains a list of partial solutions, generates next thoughts via the model, evaluates them, and prunes. For example, using the OpenAI-compatible endpoint at http://localhost:11434/v1, the operator sends multiple requests per step. The workflow is custom and not built into any local inference server as of 2025.

Reviewed by Fredoline Eruo. See our editorial policy.

Buyer guides

When it doesn't work