Advanced RAG — Chunking, Retrieval, Re-ranking

16. Dynamic Context Windows

Chapter 16 of 24 · 20 min

KEY INSIGHT

Context window allocation should adapt to query complexity rather than using fixed limits. ### Fixed vs. Dynamic Allocation A naive RAG system uses a fixed token budget (e.g., 4096 tokens) or fixed chunk count (e.g., top 5 chunks) for the context window. This fails when simple queries need 2 chunks and complex queries need 15. Dynamic context windows allocate retrieval tokens proportionally to query complexity. ### Complexity Scoring Query complexity can be estimated via multiple signals: | Signal | Metric | Complexity Indicator | |---|---|---| | Token count | `len(query.split())` | Higher → more variables to answer | | Sub-query count | Number of facets detected | More facets → more context needed | | Entity count | NER-extracted named entities | More entities → more cross-reference needed | | Question type | "compare", "analyze", "list" | Compare/list → multi-document synthesis | | Embedding variance | Std dev of retrieval scores | High variance → confident single answer | ```python import tiktoken def estimateQueryComplexity(query: str, retrieval_results: list[dict]) -> dict: encoding = tiktoken.get_encoding("cl100k_base") tokens = len(encoding.encode(query)) score = 0.0 # Signal 1: Token count score += min(tokens / 100, 2.0) # Cap at 2 points # Signal 2: Score variance of top results if retrieval_results: scores = [r.get("score", 0) for r in retrieval_results] score += max(1.0 - np.std(scores) * 3, 0) # High variance = harder else: score += 1.0 # No results = uncertain, add complexity # Normalize to 0-1 range normalized = min(score / 5.0, 1.0) return { "raw_score": score, "normalized": normalized, "tier": "simple" if normalized < 0.3 else "medium" if normalized < 0.6 else "complex" } ``` ### Window Allocation Strategy ```python MAX_TOKENS = 4096 TOKEN_RESERVE = 512 # Reserve for answer generation SIMPLE_CHUNKS = 2 MEDIUM_CHUNKS = 5 COMPLEX_CHUNKS = 10 def allocateContext( query: str, retrieval_results: list[dict], model: str = "gpt-4o-mini" ) -> list[dict]: complexity = estimateQueryComplexity(query, retrieval_results) tier = modelComplexityTier(complexity["normalized"]) if tier == "simple": top_k = SIMPLE_CHUNKS elif tier == "medium": top_k = MEDIUM_CHUNKS else: top_k = COMPLEX_CHUNKS allocated = retrieval_results[:top_k] total_tokens = sum(countTokens(c["text"]) for c in allocated) # If even top-k exceeds budget, pull back if total_tokens > MAX_TOKENS - TOKEN_RESERVE: allocated = trimToTokenBudget(allocated, MAX_TOKENS - TOKEN_RESERVE) return allocated def trimToTokenBudget(chunks: list[dict], budget: int) -> list[dict]: encoding = tiktoken.get_encoding("cl100k_base") trimmed = [] for chunk in chunks: tokens = len(encoding.encode(chunk["text"])) if budget - tokens >= 0: trimmed.append(chunk) budget -= tokens else: break return trimmed ``` ### Failure Modes Complexity scoring can misclassify queries that are short but semantically dense (e.g., "explain the concept" where "concept" has heavy domain ambiguity). Validate complexity tiers against actual answer quality metrics, not just internal routing labels. Over-allocation wastes tokens on irrelevant content; under-allocation drops key facts.

EXERCISE

Implement complexity scoring and dynamic allocation. Run a set of 20 queries through both fixed and dynamic windows, measuring token usage and answer correctness delta. (15 min)