What this does

A sliding context window strategy maintains conversation history within a fixed token budget by discarding the oldest messages once the limit is approached. This approach preserves the most recent and relevant context while preventing token overflow errors and unnecessary cost increases that accompany sending the full conversation on every request.

Steps

Set a max_context_tokens value based on the model's context window minus system prompt and response budgets. For example, 3000 tokens for a 4096 total window.
Maintain a list of conversation messages, each with a precomputed token count attribute.
Before sending a request, sum the token counts of all messages in the list.
If the total exceeds max_context_tokens, remove the oldest message(s) from the list until the total fits within the budget.
Preserve critical messages—system instructions, user preferences, or pinned context—by marking them as immutable so they are never dropped during the sliding process.
Prepend the trimmed history to the outgoing API request along with the system prompt.
Append the model's response to the conversation history after each turn.
After trimming, log which messages were removed so developers can audit context loss during development.

Verification

python3 -c "
messages = [{'role': 'user', 'tokens': 120}, {'role': 'assistant', 'tokens': 80}, {'role': 'user', 'tokens': 150}]
max_context = 250
total = sum(m['tokens'] for m in messages)
trimmed = []
for m in messages:
    if sum(x['tokens'] for x in trimmed) + m['tokens'] <= max_context:
        trimmed.append(m)
print(f'Retained: {len(trimmed)} messages, {sum(m[\"tokens\"] for m in trimmed)} tokens')
"
# Expected output: Retained: 2 messages, 200 tokens

Common failures

Dropping critical early context: Long-running conversations lose initial requirements or constraints embedded in early messages. Solution: implement a priority tagging system where early messages can be flagged as immovable.
Forgetting to update token counts after edits: Manually editing message content without recalculating the token count causes the sliding logic to use stale data. Solution: always recompute token counts from raw text before summing.
No sentinel for context loss: The application continues silently dropping messages with no visible indicator, causing confusing model behavior. Solution: emit a warning log entry whenever a message is removed from the sliding window.
Trimming mid-conversation turn: Removing a message between a multi-part user request and its associated assistant response breaks logical continuity. Solution: atomic trimming: only remove messages in complete request-response pairs.

How to use a sliding context window strategy to manage long conversations within token limits

What this does

Steps

Verification

Common failures

Related guides