What this does

Speculative decoding uses a small "draft" model to propose tokens, which the large "target" model then verifies in parallel. This can achieve 1.5-3x speedup without any quality loss.

Steps

Enable speculative decoding in vLLM with a draft model.

from vllm import LLM, SamplingParams

llm = LLM(
    model="meta-llama/Llama-3.2-3B",      # target
    speculative_model="meta-llama/Llama-3.2-1B",  # draft
    num_speculative_tokens=5,
)
params = SamplingParams(temperature=0.7, max_tokens=256)
outputs = llm.generate("Explain quantum computing", params)
print(outputs[0].outputs[0].text)

Configure speculative decoding via vLLM API server.

python -m vllm.entrypoints.openai.api_server \
    --model meta-llama/Llama-3.2-7B \
    --speculative-model meta-llama/Llama-3.2-1B \
    --num-speculative-tokens 5 \
    --speculative-draft-tensor-parallel-size 1

Use speculative decoding in llama.cpp.

# Draft model on GPU, target model on GPU
./llama-speculative \
    -m target-model.gguf \
    -md draft-model.gguf \
    -p "Write a Python function" \
    -n 256 \
    --draft 5

Benchmark spec decode vs. standard decode.

import time

def benchmark(llm, prompt, runs=5):
    times = []
    for _ in range(runs):
        start = time.perf_counter()
        llm.generate(prompt)
        times.append(time.perf_counter() - start)
    return sum(times) / len(times)

# Standard
standard = benchmark(standard_llm, "Explain AI")
# Speculative
spec = benchmark(spec_llm, "Explain AI")
print(f"Standard: {standard:.2f}s, Speculative: {spec:.2f}s, Speedup: {standard/spec:.2f}x")

Verification

# Expected: Speculative decoding produces identical output to standard decoding (same temperature/seed)
# but with 1.5-3x lower latency

Common failures

Draft model too large: The draft model must be significantly smaller (3-5x fewer parameters) to provide speedup.
Low acceptance rate: If the draft model is too different from the target, few draft tokens are accepted. Use a draft model from the same model family.
VRAM exhaustion: Two models loaded simultaneously. Use quantized versions or ensure adequate VRAM (draft model adds ~20-40% overhead).

How to enable and configure speculative decoding for faster generation

What this does

Steps

Verification

Common failures

Related guides