What this does

Speculative decoding has several knobs: number of draft tokens, acceptance threshold, and draft/target memory split. This guide systematically tunes these for maximum speedup on your hardware.

Steps

Tune num_speculative_tokens (draft length). Start with 5 and test in increments of 5 up to 20.

from vllm import LLM, SamplingParams
import time

for draft_len in [3, 5, 7, 10, 15, 20]:
    llm = LLM(
        model="meta-llama/Llama-3.2-3B",
        speculative_model="meta-llama/Llama-3.2-1B",
        num_speculative_tokens=draft_len,
    )
    start = time.perf_counter()
    llm.generate("Write a story", SamplingParams(max_tokens=512))
    elapsed = time.perf_counter() - start
    print(f"Draft={draft_len}: {512/elapsed:.0f} tok/s")

Expected: Throughput increases with draft length then plateaus. The plateau point depends on acceptance rate.

Measure acceptance rate to diagnose draft quality.

# vLLM returns acceptance statistics in the output
output = llm.generate("Explain gravity", SamplingParams(max_tokens=128, temperature=0))
stats = output[0].metrics
accepted = stats.accepted_tokens
drafted = stats.drafted_tokens
print(f"Acceptance rate: {accepted}/{drafted} = {accepted/drafted*100:.1f}%")

High acceptance (>70%): draft model is well-matched. Low acceptance (<40%): try a larger or better-aligned draft model.

Adjust draft/target memory split for limited VRAM. On a 24 GB GPU running both models:

# Use quantized draft model to save VRAM
python -m vllm.entrypoints.openai.api_server \
    --model meta-llama/Llama-3.2-7B \
    --speculative-model unsloth/Llama-3.2-1B-GGUF \
    --speculative-model-quantization q4_k_m \
    --num-speculative-tokens 5

Set speculative-draft-tensor-parallel-size for multi-GPU.

python -m vllm.entrypoints.openai.api_server \
    --model meta-llama/Llama-3.2-70B \
    --tensor-parallel-size 4 \
    --speculative-model meta-llama/Llama-3.2-8B \
    --speculative-draft-tensor-parallel-size 1

The draft model on a single GPU avoids inter-GPU communication overhead.

Run a sweep across all parameter combinations.

results = []
for draft_len in [3, 5, 7, 10]:
    for draft_q in [None, "q4_k_m", "q8_0"]:
        llm = LLM(
            model=target_model,
            speculative_model=draft_model,
            num_speculative_tokens=draft_len,
            speculative_model_quantization=draft_q,
        )
        tps = benchmark(llm)
        results.append((draft_len, draft_q, tps))
for r in results:
    print(f"draft={r[0]}, q={r[1]}, tps={r[2]}")

Verification

# Expected output: A table showing the best (draft_length, quantization) pair for your hardware
# Example: Best config: draft=7, q4_k_m, 85 tok/s (vs 35 tok/s without speculation)

Common failures

No speedup at low draft lengths: < 3 draft tokens adds overhead without enough parallelism. Minimum useful draft length is 5.
Draft model too large reduces memory for KV cache: The draft model shares VRAM with the target. Leave at least 20% VRAM headroom.
Determinism differs: Speculative decoding with non-zero temperature can produce different outputs than standard decoding due to different random seeds.

How to tune the speculation parameters for your hardware setup

What this does

Steps

Verification

Common failures

Related guides