RUNLOCALAIv38
->Will it run?Best GPUCompareTroubleshootStartLearnPulseModelsHardwareToolsBench
Run check
RUNLOCALAI

Independently operated catalog for local-AI hardware and software. Hand-written verdicts. Source-cited claims. Reproducible commands when we have them.

OP·Fredoline Eruo
DIR
  • Models
  • Hardware
  • Tools
  • Benchmarks
TOOLS
  • Will it run?
  • Compare hardware
  • Cost vs cloud
  • Choose my GPU
  • Prompting kits
  • Quick answers
REF
  • All buyer guides
  • Learn local AI
  • Methodology
  • Glossary
  • Errors KB
  • Trust
EDITOR
  • About
  • Author
  • How we make money
  • Editorial policy
  • Contact
LEGAL
  • Privacy
  • Terms
  • Sitemap
MAIL · MONTHLY DIGEST
Get monthly local AI changes
Monthly recap. No spam.
DISCLOSURE

Some links on this site are affiliate links (Amazon Associates and other first-class retailers). When you buy through them, we earn a small commission at no extra cost to you. Affiliate links do not influence our verdicts — there are cards we rate highly that we don't have affiliate relationships with, and cards that sell well that we refuse to recommend. Read more →

© 2026 runlocalai.coIndependently operated
RUNLOCALAI · v38
Glossary / Large language models / Speculative Decoding
Large language models

Speculative Decoding

Speculative decoding speeds up LLM inference by using a small fast "draft" model to propose the next several tokens, then verifying them all in parallel with the large "target" model. When the draft is right (which it often is for routine tokens), you get 2-4× speedup; when wrong, you fall back to the standard autoregressive flow.

The key insight: verifying N tokens with the target model takes only one forward pass, while generating them autoregressively takes N. The draft model burns extra compute but saves more in reduced target-model passes.

For local AI: pair a 1B draft model with a 7B-70B target model from the same family (same tokenizer, similar training). llama.cpp supports this via --draft-model, vLLM via --speculative-model. Real speedups vary 1.5-3× depending on workload — code completion benefits most; creative writing benefits least.

Related terms

Inference
Buyer guides
  • Best GPU for local AI →
  • Best laptop for local AI →
  • Best Mac for local AI →
When it doesn't work
  • CUDA out of memory →
  • Ollama running slowly →
  • ROCm not detected →