Draft Models — Model Optimization for Local Inference (Chapter 8)

Draft model selection fundamentally determines speculative decoding performance. The relationship between draft and target models involves architecture compatibility, parameter size ratios, and capability overlap.

Architecture matching maximizes KV cache sharing. When draft and target share the same attention implementation, the target model can reuse draft model's computed keys and values for accepted tokens. This reduces memory bandwidth by up to 50% for accepted tokens.

Parameter size ratios typically range from 1:10 to 1:5 (draft:target). A 70B target commonly uses a 7B draft. Larger drafts (e.g., 13B) sometimes outperform 7B drafts when their accuracy justifies the additional computation per speculation round.

from transformers import AutoModelForCausalLM, AutoTokenizer

# Draft model configuration for coding tasks
draft = AutoModelForCausalLM.from_pretrained(
    "codellama/CodeLlama-7b-Python-hf",
    torch_dtype=torch.float16,
    device_map="auto"
)

# Target model for production inference
target = AutoModelForCausalLM.from_pretrained(
    "codellama/CodeLlama-70b-Python-hf",
    torch_dtype=torch.float16,
    device_map="auto"
)

# Verify architectural compatibility
assert draft.config.hidden_size == target.config.hidden_size
assert draft.config.num_attention_heads == target.config.num_attention_heads
# Different vocab_size OK if using unified tokenizer

Training a custom draft model can exceed pre-trained model performance by 5-15% acceptance rate improvement. The training dataset should match the target's distribution—code drafts for code targets, prose drafts for language models.

# Draft model training configuration
from transformers import TrainingArguments

training_args = TrainingArguments(
    output_dir="./draft-model",
    num_train_epochs=3,
    per_device_train_batch_size=8,
    gradient_accumulation_steps=4,
    learning_rate=1e-4,
    weight_decay=0.01,
    warmup_ratio=0.1,
    # Critical: Match target's training distribution
    dataset_text_field="text",
    max_seq_length=2048,
)

KV cache sharing across draft-target pairs requires vLLM or custom implementations. The draft computes attention states once; the target reuses them for accepted tokens rather than recomputing.

# vLLM speculative decoding with KV cache sharing
from vllm import LLM, SamplingParams

llm = LLM(
    model="meta-llama/Llama-2-70b-hf",
    speculative_model="meta-llama/Llama-2-7b-hf",  # Draft model
    num_speculative_tokens=4,  # Draft tokens per round
    tensor_parallel_size=2,    # Multi-GPU
)

results = llm.generate("Write Python code for quicksort", SamplingParams(temperature=0))

Failure modes to anticipate:

Draft divergence: Draft models that diverge significantly from the target's predictions cause low acceptance rates. Monitor accept_rate during inference—below 0.5 indicates problematic divergence.

Context sensitivity: Some drafts perform well on short contexts but degrade on long contexts. Test acceptance rates across your expected context length range.

Tokenizer mismatch: Different tokenizers cause subtle acceptance failures. Always verify vocabulary compatibility before deployment.