HOW-TO · INF

How to enable and configure speculative decoding for faster generation

advanced20 minBy Fredoline Eruo
PREREQUISITES

vLLM or llama.cpp with speculative decoding support

What this does

Speculative decoding uses a small "draft" model to propose tokens, which the large "target" model then verifies in parallel. This can achieve 1.5-3x speedup without any quality loss.

Steps

  1. Enable speculative decoding in vLLM with a draft model.

    from vllm import LLM, SamplingParams
    
    llm = LLM(
        model="meta-llama/Llama-3.2-3B",      # target
        speculative_model="meta-llama/Llama-3.2-1B",  # draft
        num_speculative_tokens=5,
    )
    params = SamplingParams(temperature=0.7, max_tokens=256)
    outputs = llm.generate("Explain quantum computing", params)
    print(outputs[0].outputs[0].text)
    
  2. Configure speculative decoding via vLLM API server.

    python -m vllm.entrypoints.openai.api_server \
        --model meta-llama/Llama-3.2-7B \
        --speculative-model meta-llama/Llama-3.2-1B \
        --num-speculative-tokens 5 \
        --speculative-draft-tensor-parallel-size 1
    
  3. Use speculative decoding in llama.cpp.

    # Draft model on GPU, target model on GPU
    ./llama-speculative \
        -m target-model.gguf \
        -md draft-model.gguf \
        -p "Write a Python function" \
        -n 256 \
        --draft 5
    
  4. Benchmark spec decode vs. standard decode.

    import time
    
    def benchmark(llm, prompt, runs=5):
        times = []
        for _ in range(runs):
            start = time.perf_counter()
            llm.generate(prompt)
            times.append(time.perf_counter() - start)
        return sum(times) / len(times)
    
    # Standard
    standard = benchmark(standard_llm, "Explain AI")
    # Speculative
    spec = benchmark(spec_llm, "Explain AI")
    print(f"Standard: {standard:.2f}s, Speculative: {spec:.2f}s, Speedup: {standard/spec:.2f}x")
    

Verification

# Expected: Speculative decoding produces identical output to standard decoding (same temperature/seed)
# but with 1.5-3x lower latency

Common failures

  • Draft model too large: The draft model must be significantly smaller (3-5x fewer parameters) to provide speedup.
  • Low acceptance rate: If the draft model is too different from the target, few draft tokens are accepted. Use a draft model from the same model family.
  • VRAM exhaustion: Two models loaded simultaneously. Use quantized versions or ensure adequate VRAM (draft model adds ~20-40% overhead).

Related guides

RELATED GUIDES