RUNLOCALAIv38
->Will it run?Best GPUCompareTroubleshootStartLearnPulseModelsHardwareToolsBench
Run check
RUNLOCALAI

Independently operated catalog for local-AI hardware and software. Hand-written verdicts. Source-cited claims. Reproducible commands when we have them.

OP·Fredoline Eruo
DIR
  • Models
  • Hardware
  • Tools
  • Benchmarks
TOOLS
  • Will it run?
  • Compare hardware
  • Cost vs cloud
  • Choose my GPU
  • Prompting kits
  • Quick answers
REF
  • All buyer guides
  • Learn local AI
  • Methodology
  • Glossary
  • Errors KB
  • Trust
EDITOR
  • About
  • Author
  • How we make money
  • Editorial policy
  • Contact
LEGAL
  • Privacy
  • Terms
  • Sitemap
MAIL · MONTHLY DIGEST
Get monthly local AI changes
Monthly recap. No spam.
DISCLOSURE

Some links on this site are affiliate links (Amazon Associates and other first-class retailers). When you buy through them, we earn a small commission at no extra cost to you. Affiliate links do not influence our verdicts — there are cards we rate highly that we don't have affiliate relationships with, and cards that sell well that we refuse to recommend. Read more →

© 2026 runlocalai.coIndependently operated
RUNLOCALAI · v38
Tasks/Scientific/Theorem Proving
Scientific
formal verification ai
lean ai
proof assistant ai

Theorem Proving

AI-assisted formal theorem proving in Lean, Coq, Isabelle. DeepSeek-Prover, Lean Copilot, AlphaProof-lineage.

Capability notes

AI-assisted theorem proving in 2026 operates primarily through **Lean 4** — the dominant interactive theorem prover with the largest open-source math library (mathlib4, 1.5M+ theorems). **Coq/Rocq** has deeper formalization history but weaker LLM tooling. The AI integration story centers on lean-copilot, a VS Code extension connecting proof context to LLM backends. **What LLMs can do.** The capability ceiling is **proof completion**: given a theorem statement and a partial proof skeleton, an LLM fills in remaining tactics. On standard library proofs (algebra, number theory, linear algebra), [DeepSeek V3](/models/deepseek-v3) and [Qwen 3 235B-A22B](/models/qwen-3-235b-a22b) achieve 40-55% completion rates on LeanDojo, with correct proofs requiring 2-4 LLM attempts per lemma. Completion drops to 15-25% on novel proofs requiring synthesis of multiple mathlib lemmas without local syntactic overlap. **What LLMs cannot do.** Autonomous proof generation from plain-English statements fails 90%+. LLMs cannot detect circular reasoning — generated proofs that assume the theorem being proved will type-check but are logically vacuous. They struggle with dependent type manipulations, universe level constraints, and termination proofs for recursive functions — these are architecture-agnostic failures, not model-scale limitations. LLMs systematically generate the longer, more fragile proof path when multiple valid proofs exist. **lean-copilot** feeds proof context (theorem statement, hypotheses, goals, open namespaces) from the Lean 4 LSP server to an LLM backend. Two modes: auto-complete (fills the next tactic, 50-60% acceptance) and full-proof (attempts to close all goals, 15-25% acceptance). Full-proof mode is an exploration tool, not a trusted proof generator. **Landscape.** Lean 4 is winning — mathlib4 is the fastest-growing formal math library. Coq/Rocq has stronger extraction-to-code capabilities (verified algorithms to OCaml/Haskell) but sparse AI tooling. Isabelle and HOL have niche formalization communities with negligible AI integration. For AI-assisted proving, Lean 4 is the practical choice.

If you just want to try this

Lowest-friction path to a working setup.

Install Lean 4 and lean-copilot for VS Code. Budget 2-3 weeks to learn Lean syntax through the Natural Number Game and "Theorem Proving in Lean 4" chapters 1-5 before AI assistance becomes productive. Step 1: Install Lean 4 via `elan` (Lean's version manager). On macOS/Linux: `curl https://raw.githubusercontent.com/leanprover/elan/master/elan-init.sh -sSf | sh`. Windows: use the official installer. Verify: `lean --version` — need 4.7+. Step 2: Install VS Code with the `lean4` extension for syntax highlighting and LSP. Open a `.lean` file; the infoview panel shows proof goals. Step 3: Install lean-copilot from the VS Code marketplace. Configure the backend: [vLLM](/tools/vllm) for local, or OpenAI-compatible API. For local inference, serve [DeepSeek V3](/models/deepseek-v3) or [Llama 3.3 70B](/models/llama-3-3-70b) via vLLM. Model quality matters decisively — 7B and 13B models cannot produce useful Lean proofs. 70B+ is the practical floor. Step 4: Write a theorem statement in Lean syntax. lean-copilot reads the goal state from LSP. Invoke "Generate proof." The LLM returns a proof block; Lean's kernel type-checks it. If it fails, lean-copilot retries with the error as feedback. What you get: an interactive assistant where the LLM suggests tactics and the human evaluates correctness, logical coherence, and proof quality. The human-in-the-loop is essential — you must verify the LLM proved the intended theorem, not a modified version that happens to type-check. Critical: you need a 70B+ model served locally or API access to a frontier model. lean-copilot with a 7B model generates syntactically valid Lean that type-checks but proves vacuous or incorrect statements.

For production deployment

Operator-grade recommendation.

For operators evaluating AI theorem proving in a verification pipeline, the central question: does AI assistance provide net productivity gain or net verification overhead? **When it's productive.** Taxonomic proofs (proving a new type class instance satisfies its parent's axioms) follow rigid structural patterns that LLMs handle at 60-80% success. Lemma variants of existing mathlib lemmas. Mechanical steps: `simp` chains, `ring` simplifications, `linarith` arithmetic — tedious for humans, LLMs complete at 60-80%. A human formalizer who spends 30 minutes on mechanical lemma plumbing instead spends 5 minutes prompting and 10 minutes reviewing. The productivity gain is real for these categories. **When it's counterproductive.** Novel research-level proofs — LLMs cannot generate proofs for theorems absent from training data. They hallucinate plausible Lean code that type-checks but proves a weaker or different statement. Complex termination proofs — LLMs systematically fail at proving recursive termination because they cannot represent recursion structure. Universe-polymorphic reasoning and category-theoretic constructions push against Lean's inference and LLM capabilities. **Verification of generated proofs.** A Lean proof that type-checks is mathematically correct — the kernel guarantees it. The failure: the LLM rewrites the goal into something it can prove, proves that, and presents it as the original. Lean's kernel cannot detect this because the redefinition is valid Lean — it just doesn't correspond to the intended theorem. The human's essential role: verify the *proven statement* is semantically identical to the *intended theorem*. This is the most important step in an AI-augmented proof pipeline. **Pipeline architecture.** lean-copilot + [vLLM](/tools/vllm) serving a 70B+ model. For batch automation: create a Lean file with multiple lemma statements, run lean-copilot on each, collect type-checked proofs, flag proofs needing human semantic review. For interactive development: human and LLM collaborate with the LLM suggesting tactics and the human steering proof structure. **Model selection.** [DeepSeek V3](/models/deepseek-v3) leads on LeanDojo (~55% completion). [Llama 3.3 70B](/models/llama-3-3-70b) is competitive (~48%) and runs on a single [RTX 4090](/hardware/rtx-4090) at Q4 with partial offload. 32B models are not useful — completion drops below 20% and errors exceed 50% on simple lemma plumbing. Frontier closed-source models (Claude 3.7, GPT-5) perform well but require cloud API access, conflicting with some verification security requirements.

What breaks

Failure modes operators see in the wild.

- **Proof that type-checks but proves the wrong theorem.** The LLM subtly modifies the goal statement — adding an unnecessary hypothesis, weakening the conclusion, or specializing the type. Lean accepts it because the modified statement is valid. Symptom: proof compiles but a human reviewer discovers the proven theorem is a trivial corollary, not the target. Mitigation: inspect the exact goal statement the LLM proved. Implement a goal-diff check in lean-copilot: compare initial and post-tactic goals; flag changes. Never auto-merge generated proofs without human statement verification. - **Hallucinated lemmas that don't exist in mathlib.** The LLM invents lemma names following mathlib conventions that reference non-existent theorems. Symptom: Lean reports "unknown identifier" errors. Mitigation: use `#check` to verify lemma existence before accepting LLM suggestions. Search mathlib index (loogle.lean-lang.org) for each suggested lemma. Approximately 20-30% of LLM-suggested lemma names are hallucinations. - **Infinite proof search loops.** On hard theorems, the LLM generates a tactic, Lean rejects it, the error feeds back, and the LLM tries again — looping indefinitely. Symptom: GPU utilization at 100% for minutes with no proof progress. Mitigation: cap at 10 LLM attempts per goal. If the LLM suggests the same tactic 3 times consecutively, terminate — the theorem exceeds LLM capability. - **Lean 3 and Lean 4 syntax confusion.** Training data includes both versions. The LLM mixes syntax: `rw` vs `rw []`, `begin...end` blocks in Lean 4. Symptom: compile errors that waste time. Mitigation: use models trained predominantly on Lean 4 corpora (DeepSeek V3, Llama 3.3). This failure mode is annoying but not dangerous — it produces errors, not incorrect proofs. - **Informal-to-formal translation failure.** The LLM generates a correct natural language proof but Lean code implementing a different logic. Symptom: the human reads the text, agrees it's right, then discovers the code proves something else. Mitigation: never trust natural language explanations. Only the type-checked Lean code is the proof. Review Lean code directly; treat natural language output as commentary.

Hardware guidance

**Hobbyist: Consumer GPU with 16GB+ VRAM** [RTX 4070 Ti 16GB](/hardware/rtx-4070-ti) or [RTX 4080 Super 16GB](/hardware/rtx-4080-super) runs Llama 3.3 70B at Q4 with partial offload at 10-15 tok/s. Proof suggestion latency: 3-8 seconds — acceptable for interactive use. [MacBook Pro 16 M4 Max](/hardware/macbook-pro-16-m4-max) with 64GB unified memory runs the same model at 15-20 tok/s. 12GB GPUs cannot fit 70B models for useful Lean proof generation. **SMB: 2-4 person formalization team** One [RTX 4090](/hardware/rtx-4090) (24GB) serving Llama 3.3 70B Q4 via [vLLM](/tools/vllm) to 2-4 VS Code instances at 20-30 tok/s shared. vLLM continuous batching multiplexes requests. Cost: ~$1,800 GPU + ~$500 system = $2,300 one-time. For teams formalizing textbooks or verifying cryptographic protocols, this is dramatically cheaper than API calls per proof step. **Enterprise: Formal verification lab** [NVIDIA L40S](/hardware/nvidia-l40s) (48GB) or [RTX 6000 Ada](/hardware/rtx-6000-ada) (48GB) serves DeepSeek V3 at FP8 or multiple concurrent Llama 3.3 instances handling 5-10 formalizers at sub-5-second latency. Colocate GPU with the team for sub-10ms network latency — cross-continent latency adds 100-200ms per round-trip, compounding across proof steps. Air-gapped deployment satisfies defense and fintech verification requirements. **Frontier: Dedicated verification cluster** [NVIDIA H100 PCIe](/hardware/nvidia-h100-pcie) (80GB) serves DeepSeek V3 at FP16 with 2 TB/s bandwidth, reducing latency to 1-2 seconds. A 4x H100 cluster serves 20+ formalizers. Justified only for critical infrastructure verification (compiler correctness, OS kernel properties, cryptographic protocol security) where a verification error has regulatory or safety consequences.

Runtime guidance

**Individual Lean user with a 70B-capable GPU? → lean-copilot + vLLM** lean-copilot communicates with any OpenAI-compatible API. Point it at a local [vLLM](/tools/vllm) instance serving [DeepSeek V3](/models/deepseek-v3) or [Llama 3.3 70B](/models/llama-3-3-70b). vLLM is preferred over [llama.cpp](/tools/llama-cpp) — continuous batching reduces latency for bursty proof queries. llama.cpp's sequential batch adds 1-3 seconds per query. Configure via VS Code settings: API endpoint to `localhost:8000/v1`, model name to your served model. Enable full-proof mode sparingly. **Team Lean AI deployment? → vLLM + shared GPU server** One GPU running vLLM with concurrent instances serves 2-10 lean-copilot clients. Set `max-model-len` to the model's full context window (32K for Llama 3.3, 128K for DeepSeek V3) — proofs pull in large mathlib contexts. Provision 5-8GB VRAM per concurrent user for KV cache. **Air-gapped or classified environment? → Air-gapped vLLM + local weights** vLLM runs offline. Download model weights once into the enclave, serve via vLLM, connect classified VS Code workstations. Mirror Lean 4 toolchain (kernel, mathlib cache) inside the enclave — `lake` fetches from a local mirror. Eliminates API-as-attack-surface risk. **Experimenting with Coq/Rocq? → Manual LSP integration** No lean-copilot equivalent for Coq. The Coq LSP exposes proof state through LSP protocol; build middleware to extract goals, format for LLM, receive tactics, insert into buffer. Plan 3-6 weeks of engineering for a minimum viable Coq AI assistant. Coq's LLM training data is sparser — 10-15% lower completion rates than Lean. For organizations not committed to Coq, Lean 4 is the more AI-viable system. **Specialized math models?** As of mid-2026, no open-weight model is fine-tuned specifically for theorem proving above baseline. DeepSeek V3 and Llama 3.3 are general-purpose models trained on enough mathlib/arXiv data to perform adequately. Monitor DeepSeek and Qwen families for future math-specific fine-tunes.

Setup walkthrough

  1. Install Lean 4: follow the installation guide at lean-lang.org (VS Code extension + elan toolchain manager). Takes ~10 minutes.
  2. git clone https://github.com/leanprover-community/mathlib4 (Lean's mathematical library).
  3. For AI-assisted proving: install Lean Copilot (VS Code extension) — uses a local or remote LLM to suggest proof steps.
  4. Write a simple theorem: theorem add_comm (a b : Nat) : a + b = b + a := by { ... } — place cursor after by, Lean Copilot suggests the induction + rewrite steps.
  5. First AI-assisted proof in <30 minutes of setup — you need basic Lean syntax knowledge (1-2 hours of learning).
  6. For stronger proving models: DeepSeek Prover V2 can be run locally via Ollama/VLLM and called from Lean via the Lean REPL + LLM bridge.
  7. Alternative: Coq + CoqPilot (VS Code extension) for Coq-based formal verification.

The cheap setup

Theorem proving is CPU-bound and RAM-light. Lean 4 + mathlib4 runs on any $300 laptop (Ryzen 5/Intel i5 + 16 GB RAM). The proofs themselves compile in milliseconds. For AI-assisted proving on a budget: use a cloud API (DeepSeek API, $0.50 per 1M tokens) for proof suggestions, or run a distilled reasoning model (DeepSeek R1 Distill 7B) on a used GTX 1060 6 GB ($60). The LLM is a suggestion engine — the proof checker (Lean kernel) is the authority and it's computationally trivial. $300 + free cloud API tier is genuinely viable.

The serious setup

Used RTX 3090 24 GB (~$700-900, see /hardware/rtx-3090). Runs DeepSeek Prover V2 locally — the strongest open-weight theorem proving model. Generates full Lean proofs for undergraduate-to-graduate-level mathematics. Pair with Ryzen 7 7700X + 64 GB DDR5 + 1TB NVMe. Total: ~$1,800-2,200. For research-grade proving (IMO/IMO-level problems): the field is still dominated by closed-source frontier models. But DeepSeek Prover V2 + Lean Copilot on an RTX 3090 handles most undergraduate pure math problems. Formal verification (not proof discovery) runs on CPU alone.

Common beginner mistake

The mistake: Expecting an LLM to "auto-prove" a theorem without learning Lean or Coq syntax first. Why it fails: LLMs generate proof text, but you need to understand the proof assistant's error messages to iterate. The model says rw [add_comm] — if Lean rejects it, you can't fix it without knowing what rw does. Theorem proving with AI is a collaboration, not automation. The fix: Spend 2-4 hours learning basic Lean syntax (Natural Number Game is the canonical intro — lean-lang.org/nng). Learn what intro, apply, rw, induction, cases do. Then the LLM becomes a powerful autocomplete for proofs rather than a black box you can't debug. The LLM's job is suggesting steps, not guaranteeing correctness.

Recommended setup for theorem proving

Recommended hardware
Best GPU for local AI →
All workloads ranked across VRAM tiers.
Recommended runtimes

Browse all tools for runtimes that fit this workload.

Budget build
AI PC under $1,000 →
Best GPU for this task
Best GPU for local AI →

Reality check

Local AI workloads have real hardware constraints that vary by task type. VRAM ceiling decides what model fits; bandwidth decides decode speed; compute decides prefill speed. Pick the GPU tier that fits your actual workload, not the spec sheet.

Common mistakes

  • Buying for spec-sheet VRAM without modeling KV cache + activation overhead
  • Underestimating quantization quality loss below Q4
  • Skipping flash-attention support (real perf gap on long context)
  • Ignoring sustained-load thermals (laptops thermal-throttle within 30 min)

What breaks first

The errors most operators hit when running theorem proving locally. Each links to a diagnose+fix walkthrough.

  • CUDA out of memory →
  • Model keeps crashing →
  • Ollama running slow →
  • llama.cpp too slow →

Before you buy

Verify your specific hardware can handle theorem proving before committing money.

  • Will it run on my hardware? →
  • Custom compatibility check →
  • GPU recommender (4 questions) →
Buyer guides
  • Best GPU for local AI →
  • Best laptop for local AI →
  • Best Mac for local AI →
  • Best used GPU for local AI →
  • Will it run on my hardware? →
Compare hardware
  • Curated head-to-heads →
  • Custom comparison tool →
  • RTX 4090 vs RTX 5090 →
  • RTX 3090 vs RTX 4090 →
Troubleshooting
  • CUDA out of memory →
  • Ollama running slowly →
  • ROCm not detected →
  • Model keeps crashing →
Specialized buyer guides
  • GPU for ComfyUI (image-gen) →
  • GPU for KoboldCpp (RP/long-context) →
  • GPU for AI agents →
  • GPU for local OCR →
  • GPU for voice cloning →
  • Upgrade from RTX 3060 →
  • Beginner setup →
  • AI PC for students →
Updated 2026 roundup
  • Best free local AI tools (2026) →