RUNLOCALAIv38
->Will it run?Best GPUCompareTroubleshootStartLearnPulseModelsHardwareToolsBench
Run check
RUNLOCALAI

Independently operated catalog for local-AI hardware and software. Hand-written verdicts. Source-cited claims. Reproducible commands when we have them.

OP·Fredoline Eruo
DIR
  • Models
  • Hardware
  • Tools
  • Benchmarks
TOOLS
  • Will it run?
  • Compare hardware
  • Cost vs cloud
  • Choose my GPU
  • Prompting kits
  • Quick answers
REF
  • All buyer guides
  • Learn local AI
  • Methodology
  • Glossary
  • Errors KB
  • Trust
EDITOR
  • About
  • Author
  • How we make money
  • Editorial policy
  • Contact
LEGAL
  • Privacy
  • Terms
  • Sitemap
MAIL · MONTHLY DIGEST
Get monthly local AI changes
Monthly recap. No spam.
DISCLOSURE

Some links on this site are affiliate links (Amazon Associates and other first-class retailers). When you buy through them, we earn a small commission at no extra cost to you. Affiliate links do not influence our verdicts — there are cards we rate highly that we don't have affiliate relationships with, and cards that sell well that we refuse to recommend. Read more →

© 2026 runlocalai.coIndependently operated
RUNLOCALAI · v38
  1. >
  2. Home
  3. /Learn
  4. /Courses
  5. /Advanced RAG — Chunking, Retrieval, Re-ranking
  6. /Ch. 16
Advanced RAG — Chunking, Retrieval, Re-ranking

16. Dynamic Context Windows

Chapter 16 of 24 · 20 min
KEY INSIGHT

Context window allocation should adapt to query complexity rather than using fixed limits. ### Fixed vs. Dynamic Allocation A naive RAG system uses a fixed token budget (e.g., 4096 tokens) or fixed chunk count (e.g., top 5 chunks) for the context window. This fails when simple queries need 2 chunks and complex queries need 15. Dynamic context windows allocate retrieval tokens proportionally to query complexity. ### Complexity Scoring Query complexity can be estimated via multiple signals: | Signal | Metric | Complexity Indicator | |---|---|---| | Token count | `len(query.split())` | Higher → more variables to answer | | Sub-query count | Number of facets detected | More facets → more context needed | | Entity count | NER-extracted named entities | More entities → more cross-reference needed | | Question type | "compare", "analyze", "list" | Compare/list → multi-document synthesis | | Embedding variance | Std dev of retrieval scores | High variance → confident single answer | ```python import tiktoken def estimateQueryComplexity(query: str, retrieval_results: list[dict]) -> dict: encoding = tiktoken.get_encoding("cl100k_base") tokens = len(encoding.encode(query)) score = 0.0 # Signal 1: Token count score += min(tokens / 100, 2.0) # Cap at 2 points # Signal 2: Score variance of top results if retrieval_results: scores = [r.get("score", 0) for r in retrieval_results] score += max(1.0 - np.std(scores) * 3, 0) # High variance = harder else: score += 1.0 # No results = uncertain, add complexity # Normalize to 0-1 range normalized = min(score / 5.0, 1.0) return { "raw_score": score, "normalized": normalized, "tier": "simple" if normalized < 0.3 else "medium" if normalized < 0.6 else "complex" } ``` ### Window Allocation Strategy ```python MAX_TOKENS = 4096 TOKEN_RESERVE = 512 # Reserve for answer generation SIMPLE_CHUNKS = 2 MEDIUM_CHUNKS = 5 COMPLEX_CHUNKS = 10 def allocateContext( query: str, retrieval_results: list[dict], model: str = "gpt-4o-mini" ) -> list[dict]: complexity = estimateQueryComplexity(query, retrieval_results) tier = modelComplexityTier(complexity["normalized"]) if tier == "simple": top_k = SIMPLE_CHUNKS elif tier == "medium": top_k = MEDIUM_CHUNKS else: top_k = COMPLEX_CHUNKS allocated = retrieval_results[:top_k] total_tokens = sum(countTokens(c["text"]) for c in allocated) # If even top-k exceeds budget, pull back if total_tokens > MAX_TOKENS - TOKEN_RESERVE: allocated = trimToTokenBudget(allocated, MAX_TOKENS - TOKEN_RESERVE) return allocated def trimToTokenBudget(chunks: list[dict], budget: int) -> list[dict]: encoding = tiktoken.get_encoding("cl100k_base") trimmed = [] for chunk in chunks: tokens = len(encoding.encode(chunk["text"])) if budget - tokens >= 0: trimmed.append(chunk) budget -= tokens else: break return trimmed ``` ### Failure Modes Complexity scoring can misclassify queries that are short but semantically dense (e.g., "explain the concept" where "concept" has heavy domain ambiguity). Validate complexity tiers against actual answer quality metrics, not just internal routing labels. Over-allocation wastes tokens on irrelevant content; under-allocation drops key facts.

EXERCISE

Implement complexity scoring and dynamic allocation. Run a set of 20 queries through both fixed and dynamic windows, measuring token usage and answer correctness delta. (15 min)

← Chapter 15
Context Optimization
Chapter 17 →
Context Compression