03. Inference-Time Compute Scaling
The fundamental innovation in reasoning models is compute allocation at inference time. Rather than fixed compute per token, reasoning models dynamically allocate more tokens to harder sub-problems. This chapter covers the mechanics and implications for operators.
The Scaling Hypothesis
Research from 2024 demonstrated that pre-training compute scaling is hitting diminishing returns. However, test-time compute scaling remains effective—a model that thinks longer before answering often produces better answers. This isn't unlimited: there are quality plateaus where additional thinking tokens provide minimal benefit.
The practical implication: you can trade latency for accuracy without retraining. A model that produces marginal improvements after 1000 reasoning tokens may be acceptable for some applications but not others.
How Reasoning Allocation Works
When R1 processes a problem, it generates tokens into a "reasoning buffer" that isn't visible in the final output. These tokens represent work: decomposition, verification, backtracking, alternative exploration. The model decides internally when the reasoning is complete and switches to output tokens.
# Simplified representation of reasoning token generation
def generate_with_reasoning(model, prompt, max_reasoning_tokens=4096):
reasoning_buffer = []
for step in range(max_reasoning_tokens):
next_token = model.forward(reasoning_buffer + prompt)
reasoning_buffer.append(next_token)
# Model has internal signal for "reasoning complete"
if model.is_reasoning_complete(reasoning_buffer):
break
# Extract final answer from reasoning buffer
return extract_final_answer(reasoning_buffer)
Budget Forcing
An advanced technique involves forcing the model to use a specific number of reasoning tokens regardless of its internal assessment. This "budget forcing" can improve consistency—the model doesn't terminate early on hard problems where more reasoning would help.
Implementation typically involves:
- Sampling with a minimum token count before first EOS token
- Sampling with a maximum token count, truncating if exceeded
- Comparing outcomes at different budget levels to find optimal tradeoffs
Implications for Serving Infrastructure
Test-time compute scaling changes the latency profile of your service. Traditional LLM serving assumes token generation rate is roughly constant. With reasoning models, you have variable-length thinking phases followed by output phases. Your autoscaling and latency SLOs must account for this variability.
Profile your current service's latency distribution. If you added reasoning with 500 additional tokens for hard cases, what would your p99 latency become? Would this be acceptable for your users?