RUNLOCALAIv38
->Will it run?Best GPUCompareTroubleshootStartLearnPulseModelsHardwareToolsBench
Run check
RUNLOCALAI

Independently operated catalog for local-AI hardware and software. Hand-written verdicts. Source-cited claims. Reproducible commands when we have them.

OP·Fredoline Eruo
DIR
  • Models
  • Hardware
  • Tools
  • Benchmarks
TOOLS
  • Will it run?
  • Compare hardware
  • Cost vs cloud
  • Choose my GPU
  • Prompting kits
  • Quick answers
REF
  • All buyer guides
  • Learn local AI
  • Methodology
  • Glossary
  • Errors KB
  • Trust
EDITOR
  • About
  • Author
  • How we make money
  • Editorial policy
  • Contact
LEGAL
  • Privacy
  • Terms
  • Sitemap
MAIL · MONTHLY DIGEST
Get monthly local AI changes
Monthly recap. No spam.
DISCLOSURE

Some links on this site are affiliate links (Amazon Associates and other first-class retailers). When you buy through them, we earn a small commission at no extra cost to you. Affiliate links do not influence our verdicts — there are cards we rate highly that we don't have affiliate relationships with, and cards that sell well that we refuse to recommend. Read more →

© 2026 runlocalai.coIndependently operated
RUNLOCALAI · v38
  1. >
  2. Home
  3. /Learn
  4. /Courses
  5. /RAG Systems: Part 1
  6. /Ch. 15
RAG Systems: Part 1

15. Context Assembly

Chapter 15 of 22 · 20 min
KEY INSIGHT

Context assembly quality matters as much as retrieval quality. Well-organized context prevents hallucination from confusing source ordering.

Context assembly determines what text the generation model actually sees. Poor assembly produces incoherent generations even with perfect retrieval. This chapter covers strategies for building effective context windows.

Chunk Presentation Ordering

Retrieval typically returns multiple chunks. The order and formatting of these chunks significantly impacts generation quality.

Bad context assembly produces confusing output:

[Chunk 1 about authentication]
[Chunk 5 about error codes]
[Chunk 3 about configuration]

This jumbled presentation causes the model to mention unrelated topics like "to configure authentication errors, check your neural network training setup."

Good context assembly groups related content:

def assemble_context(query: str, retrieved_chunks: list[dict]) -> str:
    # Sort chunks by relevance score descending
    sorted_chunks = sorted(
        retrieved_chunks, 
        key=lambda x: x["score"], 
        reverse=True
    )
    
    # Group by section/source for coherent presentation
    grouped = {}
    for chunk in sorted_chunks:
        source = chunk.get("metadata", {}).get("source", "unknown")
        if source not in grouped:
            grouped[source] = []
        grouped[source].append(chunk)
    
    # Build context with clear source annotations
    context_parts = []
    for source, chunks in sorted(
        grouped.items(), 
        key=lambda x: -sum(c["score"] for c in x[1])
    ):
        context_parts.append(f"Source: {source}")
        for ch in chunks:
            context_parts.append(f"- {ch['text']}")
    
    return "\n\n".join(context_parts)

Context Length Management

Models have maximum context windows (4K-128K tokens depending on model). Assembling 20 chunks of 500 tokens each consumes 10,000 tokens before generation even starts.

from your_rag_library import ContextAssembler

assembler = ContextAssembler(
    max_tokens=6000,  # Reserve tokens for generation
    overlap_tokens=100  # Avoid cutting mid-sentence
)

context = assembler.assemble(
    query="How do I configure OAuth2 single sign-on?",
    chunks=all_retrieved_chunks,
    strategy="auto"  # Automatically selects best strategy
)

print(f"Context uses {context.token_count} tokens")
# Context uses 5847 tokens

Modern LLMs typically generate 300-1000 tokens per response. Reserve 1000-2000 tokens for generation output.

Deduplication and Redundancy Removal

The same information often appears across multiple chunks. Sending duplicate content wastes tokens and confuses the model.

def remove_duplicate_chunks(chunks: list[dict], similarity_threshold: float = 0.85) -> list[dict]:
    """Remove chunks that are too similar to each other."""
    embeddings = embed_model.encode([c["text"] for c in chunks])
    unique_chunks = []
    
    for i, chunk in enumerate(chunks):
        is_duplicate = False
        for unique_chunk in unique_chunks:
            similarity = cosine_similarity(
                [embeddings[i]], 
                [embeddings[chunks.index(unique_chunk)]]
            )[0][0]
            if similarity > similarity_threshold:
                is_duplicate = True
                break
        
        if not is_duplicate:
            unique_chunks.append(chunk)
    
    return unique_chunks

Set similarity_threshold=0.85 to remove near-duplicates while preserving semantically distinct content.

EXERCISE

Implement a context assembler that receives top 20 chunks, removes duplicates above 0.85 similarity, groups by source, orders groups by max relevance score, and fits within a 4000-token limit.

← Chapter 14
Sparse Retrieval (BM25)
Chapter 16 →
Prompt with Retrieved Context