Context Assembly — RAG Systems: Part 1 (Chapter 15)

Context assembly determines what text the generation model actually sees. Poor assembly produces incoherent generations even with perfect retrieval. This chapter covers strategies for building effective context windows.

Chunk Presentation Ordering

Retrieval typically returns multiple chunks. The order and formatting of these chunks significantly impacts generation quality.

Bad context assembly produces confusing output:

[Chunk 1 about authentication]
[Chunk 5 about error codes]
[Chunk 3 about configuration]

This jumbled presentation causes the model to mention unrelated topics like "to configure authentication errors, check your neural network training setup."

Good context assembly groups related content:

def assemble_context(query: str, retrieved_chunks: list[dict]) -> str:
    # Sort chunks by relevance score descending
    sorted_chunks = sorted(
        retrieved_chunks, 
        key=lambda x: x["score"], 
        reverse=True
    )
    
    # Group by section/source for coherent presentation
    grouped = {}
    for chunk in sorted_chunks:
        source = chunk.get("metadata", {}).get("source", "unknown")
        if source not in grouped:
            grouped[source] = []
        grouped[source].append(chunk)
    
    # Build context with clear source annotations
    context_parts = []
    for source, chunks in sorted(
        grouped.items(), 
        key=lambda x: -sum(c["score"] for c in x[1])
    ):
        context_parts.append(f"Source: {source}")
        for ch in chunks:
            context_parts.append(f"- {ch['text']}")
    
    return "\n\n".join(context_parts)

Context Length Management

Models have maximum context windows (4K-128K tokens depending on model). Assembling 20 chunks of 500 tokens each consumes 10,000 tokens before generation even starts.

from your_rag_library import ContextAssembler

assembler = ContextAssembler(
    max_tokens=6000,  # Reserve tokens for generation
    overlap_tokens=100  # Avoid cutting mid-sentence
)

context = assembler.assemble(
    query="How do I configure OAuth2 single sign-on?",
    chunks=all_retrieved_chunks,
    strategy="auto"  # Automatically selects best strategy
)

print(f"Context uses {context.token_count} tokens")
# Context uses 5847 tokens

Modern LLMs typically generate 300-1000 tokens per response. Reserve 1000-2000 tokens for generation output.

Deduplication and Redundancy Removal

The same information often appears across multiple chunks. Sending duplicate content wastes tokens and confuses the model.

def remove_duplicate_chunks(chunks: list[dict], similarity_threshold: float = 0.85) -> list[dict]:
    """Remove chunks that are too similar to each other."""
    embeddings = embed_model.encode([c["text"] for c in chunks])
    unique_chunks = []
    
    for i, chunk in enumerate(chunks):
        is_duplicate = False
        for unique_chunk in unique_chunks:
            similarity = cosine_similarity(
                [embeddings[i]], 
                [embeddings[chunks.index(unique_chunk)]]
            )[0][0]
            if similarity > similarity_threshold:
                is_duplicate = True
                break
        
        if not is_duplicate:
            unique_chunks.append(chunk)
    
    return unique_chunks

Set similarity_threshold=0.85 to remove near-duplicates while preserving semantically distinct content.