How to tune the speculation parameters for your hardware setup
Speculative decoding enabled, benchmark tools
What this does
Speculative decoding has several knobs: number of draft tokens, acceptance threshold, and draft/target memory split. This guide systematically tunes these for maximum speedup on your hardware.
Steps
Tune
num_speculative_tokens(draft length). Start with 5 and test in increments of 5 up to 20.from vllm import LLM, SamplingParams import time for draft_len in [3, 5, 7, 10, 15, 20]: llm = LLM( model="meta-llama/Llama-3.2-3B", speculative_model="meta-llama/Llama-3.2-1B", num_speculative_tokens=draft_len, ) start = time.perf_counter() llm.generate("Write a story", SamplingParams(max_tokens=512)) elapsed = time.perf_counter() - start print(f"Draft={draft_len}: {512/elapsed:.0f} tok/s")Expected: Throughput increases with draft length then plateaus. The plateau point depends on acceptance rate.
Measure acceptance rate to diagnose draft quality.
# vLLM returns acceptance statistics in the output output = llm.generate("Explain gravity", SamplingParams(max_tokens=128, temperature=0)) stats = output[0].metrics accepted = stats.accepted_tokens drafted = stats.drafted_tokens print(f"Acceptance rate: {accepted}/{drafted} = {accepted/drafted*100:.1f}%")High acceptance (>70%): draft model is well-matched. Low acceptance (<40%): try a larger or better-aligned draft model.
Adjust draft/target memory split for limited VRAM. On a 24 GB GPU running both models:
# Use quantized draft model to save VRAM python -m vllm.entrypoints.openai.api_server \ --model meta-llama/Llama-3.2-7B \ --speculative-model unsloth/Llama-3.2-1B-GGUF \ --speculative-model-quantization q4_k_m \ --num-speculative-tokens 5Set
speculative-draft-tensor-parallel-sizefor multi-GPU.python -m vllm.entrypoints.openai.api_server \ --model meta-llama/Llama-3.2-70B \ --tensor-parallel-size 4 \ --speculative-model meta-llama/Llama-3.2-8B \ --speculative-draft-tensor-parallel-size 1The draft model on a single GPU avoids inter-GPU communication overhead.
Run a sweep across all parameter combinations.
results = [] for draft_len in [3, 5, 7, 10]: for draft_q in [None, "q4_k_m", "q8_0"]: llm = LLM( model=target_model, speculative_model=draft_model, num_speculative_tokens=draft_len, speculative_model_quantization=draft_q, ) tps = benchmark(llm) results.append((draft_len, draft_q, tps)) for r in results: print(f"draft={r[0]}, q={r[1]}, tps={r[2]}")
Verification
# Expected output: A table showing the best (draft_length, quantization) pair for your hardware
# Example: Best config: draft=7, q4_k_m, 85 tok/s (vs 35 tok/s without speculation)
Common failures
- No speedup at low draft lengths: < 3 draft tokens adds overhead without enough parallelism. Minimum useful draft length is 5.
- Draft model too large reduces memory for KV cache: The draft model shares VRAM with the target. Leave at least 20% VRAM headroom.
- Determinism differs: Speculative decoding with non-zero temperature can produce different outputs than standard decoding due to different random seeds.