HOW-TO · INF
How to enable and configure speculative decoding for faster generation
PREREQUISITES
vLLM or llama.cpp with speculative decoding support
What this does
Speculative decoding uses a small "draft" model to propose tokens, which the large "target" model then verifies in parallel. This can achieve 1.5-3x speedup without any quality loss.
Steps
Enable speculative decoding in vLLM with a draft model.
from vllm import LLM, SamplingParams llm = LLM( model="meta-llama/Llama-3.2-3B", # target speculative_model="meta-llama/Llama-3.2-1B", # draft num_speculative_tokens=5, ) params = SamplingParams(temperature=0.7, max_tokens=256) outputs = llm.generate("Explain quantum computing", params) print(outputs[0].outputs[0].text)Configure speculative decoding via vLLM API server.
python -m vllm.entrypoints.openai.api_server \ --model meta-llama/Llama-3.2-7B \ --speculative-model meta-llama/Llama-3.2-1B \ --num-speculative-tokens 5 \ --speculative-draft-tensor-parallel-size 1Use speculative decoding in llama.cpp.
# Draft model on GPU, target model on GPU ./llama-speculative \ -m target-model.gguf \ -md draft-model.gguf \ -p "Write a Python function" \ -n 256 \ --draft 5Benchmark spec decode vs. standard decode.
import time def benchmark(llm, prompt, runs=5): times = [] for _ in range(runs): start = time.perf_counter() llm.generate(prompt) times.append(time.perf_counter() - start) return sum(times) / len(times) # Standard standard = benchmark(standard_llm, "Explain AI") # Speculative spec = benchmark(spec_llm, "Explain AI") print(f"Standard: {standard:.2f}s, Speculative: {spec:.2f}s, Speedup: {standard/spec:.2f}x")
Verification
# Expected: Speculative decoding produces identical output to standard decoding (same temperature/seed)
# but with 1.5-3x lower latency
Common failures
- Draft model too large: The draft model must be significantly smaller (3-5x fewer parameters) to provide speedup.
- Low acceptance rate: If the draft model is too different from the target, few draft tokens are accepted. Use a draft model from the same model family.
- VRAM exhaustion: Two models loaded simultaneously. Use quantized versions or ensure adequate VRAM (draft model adds ~20-40% overhead).
Related guides
RELATED GUIDES