RUNLOCALAIv38
->Will it run?Best GPUCompareTroubleshootStartLearnPulseModelsHardwareToolsBench
Run check
RUNLOCALAI

Independently operated catalog for local-AI hardware and software. Hand-written verdicts. Source-cited claims. Reproducible commands when we have them.

OP·Fredoline Eruo
DIR
  • Models
  • Hardware
  • Tools
  • Benchmarks
TOOLS
  • Will it run?
  • Compare hardware
  • Cost vs cloud
  • Choose my GPU
  • Prompting kits
  • Quick answers
REF
  • All buyer guides
  • Learn local AI
  • Methodology
  • Glossary
  • Errors KB
  • Trust
EDITOR
  • About
  • Author
  • How we make money
  • Editorial policy
  • Contact
LEGAL
  • Privacy
  • Terms
  • Sitemap
MAIL · MONTHLY DIGEST
Get monthly local AI changes
Monthly recap. No spam.
DISCLOSURE

Some links on this site are affiliate links (Amazon Associates and other first-class retailers). When you buy through them, we earn a small commission at no extra cost to you. Affiliate links do not influence our verdicts — there are cards we rate highly that we don't have affiliate relationships with, and cards that sell well that we refuse to recommend. Read more →

© 2026 runlocalai.coIndependently operated
RUNLOCALAI · v38
  1. >
  2. Home
  3. /Learn
  4. /Courses
  5. /Model Optimization for Local Inference
  6. /Ch. 8
Model Optimization for Local Inference

08. Draft Models

Chapter 8 of 18 · 20 min
KEY INSIGHT

Draft model quality is the primary determinant of speculative decoding speedup—the target model quality determines acceptance threshold sensitivity.

Draft model selection fundamentally determines speculative decoding performance. The relationship between draft and target models involves architecture compatibility, parameter size ratios, and capability overlap.

Architecture matching maximizes KV cache sharing. When draft and target share the same attention implementation, the target model can reuse draft model's computed keys and values for accepted tokens. This reduces memory bandwidth by up to 50% for accepted tokens.

Parameter size ratios typically range from 1:10 to 1:5 (draft:target). A 70B target commonly uses a 7B draft. Larger drafts (e.g., 13B) sometimes outperform 7B drafts when their accuracy justifies the additional computation per speculation round.

from transformers import AutoModelForCausalLM, AutoTokenizer

# Draft model configuration for coding tasks
draft = AutoModelForCausalLM.from_pretrained(
    "codellama/CodeLlama-7b-Python-hf",
    torch_dtype=torch.float16,
    device_map="auto"
)

# Target model for production inference
target = AutoModelForCausalLM.from_pretrained(
    "codellama/CodeLlama-70b-Python-hf",
    torch_dtype=torch.float16,
    device_map="auto"
)

# Verify architectural compatibility
assert draft.config.hidden_size == target.config.hidden_size
assert draft.config.num_attention_heads == target.config.num_attention_heads
# Different vocab_size OK if using unified tokenizer

Training a custom draft model can exceed pre-trained model performance by 5-15% acceptance rate improvement. The training dataset should match the target's distribution—code drafts for code targets, prose drafts for language models.

# Draft model training configuration
from transformers import TrainingArguments

training_args = TrainingArguments(
    output_dir="./draft-model",
    num_train_epochs=3,
    per_device_train_batch_size=8,
    gradient_accumulation_steps=4,
    learning_rate=1e-4,
    weight_decay=0.01,
    warmup_ratio=0.1,
    # Critical: Match target's training distribution
    dataset_text_field="text",
    max_seq_length=2048,
)

KV cache sharing across draft-target pairs requires vLLM or custom implementations. The draft computes attention states once; the target reuses them for accepted tokens rather than recomputing.

# vLLM speculative decoding with KV cache sharing
from vllm import LLM, SamplingParams

llm = LLM(
    model="meta-llama/Llama-2-70b-hf",
    speculative_model="meta-llama/Llama-2-7b-hf",  # Draft model
    num_speculative_tokens=4,  # Draft tokens per round
    tensor_parallel_size=2,    # Multi-GPU
)

results = llm.generate("Write Python code for quicksort", SamplingParams(temperature=0))

Failure modes to anticipate:

Draft divergence: Draft models that diverge significantly from the target's predictions cause low acceptance rates. Monitor accept_rate during inference—below 0.5 indicates problematic divergence.

Context sensitivity: Some drafts perform well on short contexts but degrade on long contexts. Test acceptance rates across your expected context length range.

Tokenizer mismatch: Different tokenizers cause subtle acceptance failures. Always verify vocabulary compatibility before deployment.

EXERCISE

Compare acceptance rates between three draft candidates for your target model: a 7B model from the same family, a 7B model from a different family, and a 13B model. Analyze what architectural differences cause the variance.

← Chapter 7
Speculative Decoding
Chapter 9 →
FlashAttention