RUNLOCALAIv38
->Will it run?Best GPUCompareTroubleshootStartLearnPulseModelsHardwareToolsBench
Run check
RUNLOCALAI

Independently operated catalog for local-AI hardware and software. Hand-written verdicts. Source-cited claims. Reproducible commands when we have them.

OP·Fredoline Eruo
DIR
  • Models
  • Hardware
  • Tools
  • Benchmarks
TOOLS
  • Will it run?
  • Compare hardware
  • Cost vs cloud
  • Choose my GPU
  • Prompting kits
  • Quick answers
REF
  • All buyer guides
  • Learn local AI
  • Methodology
  • Glossary
  • Errors KB
  • Trust
EDITOR
  • About
  • Author
  • How we make money
  • Editorial policy
  • Contact
LEGAL
  • Privacy
  • Terms
  • Sitemap
MAIL · MONTHLY DIGEST
Get monthly local AI changes
Monthly recap. No spam.
DISCLOSURE

Some links on this site are affiliate links (Amazon Associates and other first-class retailers). When you buy through them, we earn a small commission at no extra cost to you. Affiliate links do not influence our verdicts — there are cards we rate highly that we don't have affiliate relationships with, and cards that sell well that we refuse to recommend. Read more →

© 2026 runlocalai.coIndependently operated
RUNLOCALAI · v38
Glossary / Large language models / Tree of Thoughts
Large language models

Tree of Thoughts

Tree of Thoughts (ToT) is a prompting strategy that expands a single chain of reasoning into a tree of multiple reasoning paths, evaluating each step with a heuristic (e.g., a model self-score) and pruning low-probability branches. Operators encounter ToT when they want to solve multi-step problems (e.g., math puzzles, planning) where a single chain-of-thought often fails. ToT requires multiple model calls per step, increasing latency and token usage, but can improve accuracy on tasks that benefit from exploration and backtracking.

Deeper dive

Tree of Thoughts generalizes Chain-of-Thought (CoT) by allowing the model to explore multiple reasoning paths at each step. At each step, the model generates several candidate 'thoughts' (e.g., via sampling with temperature > 0). A heuristic evaluator (often the same model, prompted to score) assesses each candidate's promise. The top-k candidates are kept, and the process repeats. This breadth-first search continues until a solution is found or a depth limit is reached. ToT can be implemented with a loop in Python using an LLM API, but it is not natively supported in most local inference servers like llama.cpp or Ollama. Operators typically implement ToT via custom scripts using Hugging Face Transformers or an OpenAI-compatible endpoint. The cost is high: a single ToT run may consume 10–100x more tokens than a simple CoT.

Practical example

An operator wants to solve a 24-game puzzle (use 4 numbers to make 24). With CoT, the model might guess an expression and fail. With ToT, the operator writes a script that at each step asks the model to propose 3 candidate next operations (e.g., 'add 3 and 5'), scores each with a prompt like 'Is this step promising?', keeps the top 2, and recurses. On an RTX 4090, each step may take ~1 second, and a full tree of depth 4 with branching factor 3 might take 40+ seconds and 10k tokens.

Workflow example

In practice, an operator using Ollama would not run ToT directly; instead they'd write a Python script that calls ollama run llama3.1:8b in a loop. The script maintains a list of partial solutions, generates next thoughts via the model, evaluates them, and prunes. For example, using the OpenAI-compatible endpoint at http://localhost:11434/v1, the operator sends multiple requests per step. The workflow is custom and not built into any local inference server as of 2025.

Reviewed by Fredoline Eruo. See our editorial policy.

Buyer guides
  • Best GPU for local AI →
  • Best laptop for local AI →
  • Best Mac for local AI →
When it doesn't work
  • CUDA out of memory →
  • Ollama running slowly →
  • ROCm not detected →