HOW-TO · DEV
How to allocate token budgets across system prompts, context window, and responses
Target environment
Ubuntu 24.04 · Ollama 0.4.x
PREREQUISITES
Understanding of model context limits and basic familiarity with the target LLM API configuration
What this does
This guide describes a strategy for partitioning a model's total token context window into three functional buckets: system prompt (instructions), conversation history (context), and response generation (output). Allocating these budgets correctly prevents truncation, reduces unnecessary API costs, and keeps model behavior aligned with the intended task.
Steps
- Determine the total context window size for the target model. For example, a 4096-token context model has 4096 slots for all input and output combined.
- Define a fixed system prompt budget. Reserve 200–500 tokens for the system prompt depending on instruction complexity. Store this value in a named constant.
- Define a fixed response budget. Reserve 256–1024 tokens for the expected response length, depending on task type. This prevents the model from consuming all remaining tokens with a single reply.
- Calculate the available conversation history budget:
context_window - system_budget - response_budget. For a 4096 window with 400-system and 512-response budgets, 3184 tokens remain for history. - Encode conversation history into the prompt, truncating from the oldest messages first when the budget is exceeded.
- Instrument the pipeline to log actual token counts for each category during development to validate allocations.
- Adjust budgets per task type: longer analytical tasks warrant higher response budgets; short Q&A tasks can allocate more to history.
- Set soft limits in the API call (e.g.,
max_tokensparameter) to enforce the response budget hard stop.
Verification
python3 -c "
context = 4096
system_budget = 400
response_budget = 512
available = context - system_budget - response_budget
print(f'Available for history: {available} tokens')
"
# Expected output: Available for history: 3184 tokens
Common failures
- History overflow: Conversation history exceeds the budget, causing truncation of recent messages and loss of critical context. Solution: implement a ring buffer that drops the oldest messages first, preserving the most recent exchanges.
- Response truncation: Setting
max_tokenstoo low causes the response to be cut mid-sentence. Solution: estimate required response length from task type and setmax_tokensto at leastestimated_output * 1.5. - System prompt bloat: Adding verbose instructions to the system prompt consumes tokens that should belong to conversation history. Solution: keep system prompts concise; extract detail into separate documents referenced by the system prompt.
- No budget monitoring: Pipeline runs without token counting, leading to unpredictable truncation. Solution: add per-request token logging in development mode.