HOW-TO · DEV
How to use a sliding context window strategy to manage long conversations within token limits
Target environment
Ubuntu 24.04 · Ollama 0.4.x
PREREQUISITES
Application using an LLM with conversation history, ability to send API requests
What this does
A sliding context window strategy maintains conversation history within a fixed token budget by discarding the oldest messages once the limit is approached. This approach preserves the most recent and relevant context while preventing token overflow errors and unnecessary cost increases that accompany sending the full conversation on every request.
Steps
- Set a
max_context_tokensvalue based on the model's context window minus system prompt and response budgets. For example, 3000 tokens for a 4096 total window. - Maintain a list of conversation messages, each with a precomputed token count attribute.
- Before sending a request, sum the token counts of all messages in the list.
- If the total exceeds
max_context_tokens, remove the oldest message(s) from the list until the total fits within the budget. - Preserve critical messages—system instructions, user preferences, or pinned context—by marking them as immutable so they are never dropped during the sliding process.
- Prepend the trimmed history to the outgoing API request along with the system prompt.
- Append the model's response to the conversation history after each turn.
- After trimming, log which messages were removed so developers can audit context loss during development.
Verification
python3 -c "
messages = [{'role': 'user', 'tokens': 120}, {'role': 'assistant', 'tokens': 80}, {'role': 'user', 'tokens': 150}]
max_context = 250
total = sum(m['tokens'] for m in messages)
trimmed = []
for m in messages:
if sum(x['tokens'] for x in trimmed) + m['tokens'] <= max_context:
trimmed.append(m)
print(f'Retained: {len(trimmed)} messages, {sum(m[\"tokens\"] for m in trimmed)} tokens')
"
# Expected output: Retained: 2 messages, 200 tokens
Common failures
- Dropping critical early context: Long-running conversations lose initial requirements or constraints embedded in early messages. Solution: implement a priority tagging system where early messages can be flagged as immovable.
- Forgetting to update token counts after edits: Manually editing message content without recalculating the token count causes the sliding logic to use stale data. Solution: always recompute token counts from raw text before summing.
- No sentinel for context loss: The application continues silently dropping messages with no visible indicator, causing confusing model behavior. Solution: emit a warning log entry whenever a message is removed from the sliding window.
- Trimming mid-conversation turn: Removing a message between a multi-part user request and its associated assistant response breaks logical continuity. Solution: atomic trimming: only remove messages in complete request-response pairs.