15. Context Assembly
Context assembly determines what text the generation model actually sees. Poor assembly produces incoherent generations even with perfect retrieval. This chapter covers strategies for building effective context windows.
Chunk Presentation Ordering
Retrieval typically returns multiple chunks. The order and formatting of these chunks significantly impacts generation quality.
Bad context assembly produces confusing output:
[Chunk 1 about authentication]
[Chunk 5 about error codes]
[Chunk 3 about configuration]
This jumbled presentation causes the model to mention unrelated topics like "to configure authentication errors, check your neural network training setup."
Good context assembly groups related content:
def assemble_context(query: str, retrieved_chunks: list[dict]) -> str:
# Sort chunks by relevance score descending
sorted_chunks = sorted(
retrieved_chunks,
key=lambda x: x["score"],
reverse=True
)
# Group by section/source for coherent presentation
grouped = {}
for chunk in sorted_chunks:
source = chunk.get("metadata", {}).get("source", "unknown")
if source not in grouped:
grouped[source] = []
grouped[source].append(chunk)
# Build context with clear source annotations
context_parts = []
for source, chunks in sorted(
grouped.items(),
key=lambda x: -sum(c["score"] for c in x[1])
):
context_parts.append(f"Source: {source}")
for ch in chunks:
context_parts.append(f"- {ch['text']}")
return "\n\n".join(context_parts)
Context Length Management
Models have maximum context windows (4K-128K tokens depending on model). Assembling 20 chunks of 500 tokens each consumes 10,000 tokens before generation even starts.
from your_rag_library import ContextAssembler
assembler = ContextAssembler(
max_tokens=6000, # Reserve tokens for generation
overlap_tokens=100 # Avoid cutting mid-sentence
)
context = assembler.assemble(
query="How do I configure OAuth2 single sign-on?",
chunks=all_retrieved_chunks,
strategy="auto" # Automatically selects best strategy
)
print(f"Context uses {context.token_count} tokens")
# Context uses 5847 tokens
Modern LLMs typically generate 300-1000 tokens per response. Reserve 1000-2000 tokens for generation output.
Deduplication and Redundancy Removal
The same information often appears across multiple chunks. Sending duplicate content wastes tokens and confuses the model.
def remove_duplicate_chunks(chunks: list[dict], similarity_threshold: float = 0.85) -> list[dict]:
"""Remove chunks that are too similar to each other."""
embeddings = embed_model.encode([c["text"] for c in chunks])
unique_chunks = []
for i, chunk in enumerate(chunks):
is_duplicate = False
for unique_chunk in unique_chunks:
similarity = cosine_similarity(
[embeddings[i]],
[embeddings[chunks.index(unique_chunk)]]
)[0][0]
if similarity > similarity_threshold:
is_duplicate = True
break
if not is_duplicate:
unique_chunks.append(chunk)
return unique_chunks
Set similarity_threshold=0.85 to remove near-duplicates while preserving semantically distinct content.
Implement a context assembler that receives top 20 chunks, removes duplicates above 0.85 similarity, groups by source, orders groups by max relevance score, and fits within a 4000-token limit.