RUNLOCALAIv38
->Will it run?Best GPUCompareTroubleshootStartLearnPulseModelsHardwareToolsBench
Run check
RUNLOCALAI

Independently operated catalog for local-AI hardware and software. Hand-written verdicts. Source-cited claims. Reproducible commands when we have them.

OP·Fredoline Eruo
DIR
  • Models
  • Hardware
  • Tools
  • Benchmarks
TOOLS
  • Will it run?
  • Compare hardware
  • Cost vs cloud
  • Choose my GPU
  • Prompting kits
  • Quick answers
REF
  • All buyer guides
  • Learn local AI
  • Methodology
  • Glossary
  • Errors KB
  • Trust
EDITOR
  • About
  • Author
  • How we make money
  • Editorial policy
  • Contact
LEGAL
  • Privacy
  • Terms
  • Sitemap
MAIL · MONTHLY DIGEST
Get monthly local AI changes
Monthly recap. No spam.
DISCLOSURE

Some links on this site are affiliate links (Amazon Associates and other first-class retailers). When you buy through them, we earn a small commission at no extra cost to you. Affiliate links do not influence our verdicts — there are cards we rate highly that we don't have affiliate relationships with, and cards that sell well that we refuse to recommend. Read more →

© 2026 runlocalai.coIndependently operated
RUNLOCALAI · v38
  1. >
  2. Home
  3. /Learn
  4. /How-to
  5. /How to tune the speculation parameters for your hardware setup
HOW-TO · INF

How to tune the speculation parameters for your hardware setup

advanced·20 min·By Fredoline Eruo
PREREQUISITES

Speculative decoding enabled, benchmark tools

What this does

Speculative decoding has several knobs: number of draft tokens, acceptance threshold, and draft/target memory split. This guide systematically tunes these for maximum speedup on your hardware.

Steps

  1. Tune num_speculative_tokens (draft length). Start with 5 and test in increments of 5 up to 20.

    from vllm import LLM, SamplingParams
    import time
    
    for draft_len in [3, 5, 7, 10, 15, 20]:
        llm = LLM(
            model="meta-llama/Llama-3.2-3B",
            speculative_model="meta-llama/Llama-3.2-1B",
            num_speculative_tokens=draft_len,
        )
        start = time.perf_counter()
        llm.generate("Write a story", SamplingParams(max_tokens=512))
        elapsed = time.perf_counter() - start
        print(f"Draft={draft_len}: {512/elapsed:.0f} tok/s")
    

    Expected: Throughput increases with draft length then plateaus. The plateau point depends on acceptance rate.

  2. Measure acceptance rate to diagnose draft quality.

    # vLLM returns acceptance statistics in the output
    output = llm.generate("Explain gravity", SamplingParams(max_tokens=128, temperature=0))
    stats = output[0].metrics
    accepted = stats.accepted_tokens
    drafted = stats.drafted_tokens
    print(f"Acceptance rate: {accepted}/{drafted} = {accepted/drafted*100:.1f}%")
    

    High acceptance (>70%): draft model is well-matched. Low acceptance (<40%): try a larger or better-aligned draft model.

  3. Adjust draft/target memory split for limited VRAM. On a 24 GB GPU running both models:

    # Use quantized draft model to save VRAM
    python -m vllm.entrypoints.openai.api_server \
        --model meta-llama/Llama-3.2-7B \
        --speculative-model unsloth/Llama-3.2-1B-GGUF \
        --speculative-model-quantization q4_k_m \
        --num-speculative-tokens 5
    
  4. Set speculative-draft-tensor-parallel-size for multi-GPU.

    python -m vllm.entrypoints.openai.api_server \
        --model meta-llama/Llama-3.2-70B \
        --tensor-parallel-size 4 \
        --speculative-model meta-llama/Llama-3.2-8B \
        --speculative-draft-tensor-parallel-size 1
    

    The draft model on a single GPU avoids inter-GPU communication overhead.

  5. Run a sweep across all parameter combinations.

    results = []
    for draft_len in [3, 5, 7, 10]:
        for draft_q in [None, "q4_k_m", "q8_0"]:
            llm = LLM(
                model=target_model,
                speculative_model=draft_model,
                num_speculative_tokens=draft_len,
                speculative_model_quantization=draft_q,
            )
            tps = benchmark(llm)
            results.append((draft_len, draft_q, tps))
    for r in results:
        print(f"draft={r[0]}, q={r[1]}, tps={r[2]}")
    

Verification

# Expected output: A table showing the best (draft_length, quantization) pair for your hardware
# Example: Best config: draft=7, q4_k_m, 85 tok/s (vs 35 tok/s without speculation)

Common failures

  • No speedup at low draft lengths: < 3 draft tokens adds overhead without enough parallelism. Minimum useful draft length is 5.
  • Draft model too large reduces memory for KV cache: The draft model shares VRAM with the target. Leave at least 20% VRAM headroom.
  • Determinism differs: Speculative decoding with non-zero temperature can produce different outputs than standard decoding due to different random seeds.

Related guides

  • How to enable and configure speculative decoding for faster generation
  • How to enable prompt caching to speed up repeated queries
RELATED GUIDES
INF
How to enable and configure speculative decoding for faster generation
INF
How to enable prompt caching to speed up repeated queries
← All how-to guidesCourses →