Advanced RAG — Chunking, Retrieval, Re-ranking

17. Context Compression

Chapter 17 of 24 · 20 min

KEY INSIGHT

Irrelevant tokens in retrieved context dilute generation quality; compressing to relevant sentences improves signal-to-noise ratio. ### The Noise-in-Context Problem Retrieved chunks often contain the answer but also surrounding context that misdirects or confuses the LLM. Context compression uses an LLM to extract only the relevant sentences from each chunk, reducing token cost and improving relevance. ### LLM-Based Compression ```python from openai import OpenAI client = OpenAI() def compressChunk(chunk_text: str, query: str, model: str = "gpt-4o-mini") -> str: system_prompt = ( "You are a precise technical assistant. Given a document passage " "and a user question, extract ONLY the sentences directly relevant " "to answering the question. Discard background, commentary, and " "tangential content. Preserve code blocks if relevant. " "Return the compressed passage with no additional text." ) response = client.chat.completions.create( model=model, messages=[ {"role": "system", "content": system_prompt}, {"role": "user", "content": f"Question: {query}\n\nPassage: {chunk_text}"} ], temperature=0.0, max_tokens=512 ) return response.choices[0].message.content.strip() def compressContext( chunks: list[dict], query: str, model: str = "gpt-4o-mini" ) -> list[dict]: compressed = [] for chunk in chunks: compressed_text = compressChunk(chunk["text"], query, model) if compressed_text: # Only include non-empty results compressed.append({ **chunk, "text": compressed_text, "original_length": len(chunk["text"]), "compressed_length": len(compressed_text) }) return compressed ``` ### Token Savings Measurement ```python def measureCompressionSavings( original_chunks: list[dict], compressed_chunks: list[dict] ) -> dict: original_tokens = sum(c["original_length"] for c in compressed_chunks) compressed_tokens = sum(c["compressed_length"] for c in compressed_chunks) # Use rough estimate: 4 chars per token orig_est = original_tokens // 4 comp_est = compressed_tokens // 4 savings = (orig_est - comp_est) / orig_est if orig_est > 0 else 0 return { "original_tokens_estimate": orig_est, "compressed_tokens_estimate": comp_est, "savings_percent": round(savings * 100, 1) } ``` ### Condensing via Extract-and-Synthesize A two-pass approach: extract relevant facts first, then synthesize into a dense paragraph. ```python def condenseContext(chunks: list[dict], query: str) -> str: # Pass 1: Extract all relevant snippets extracted = [compressChunk(c["text"], query) for c in chunks] combined = "\n\n".join(extracted) # Pass 2: Synthesize into coherent context synthesis_prompt = ( "Synthesize the following extracted passages into a single " "coherent technical summary that answers the question. " "Preserve all factual claims. Output only the summary." ) response = client.chat.completions.create( model="gpt-4o-mini", messages=[ {"role": "system", "content": synthesis_prompt}, {"role": "user", "content": f"Question: {query}\n\n{combined}"} ], temperature=0.0, max_tokens=1024 ) return response.choices[0].message.content.strip() ``` ### Failure Modes Compression can over-compress, removing context needed for cross-sentence references or pronouns. Always verify the compressed context includes entity names that the original generation step needed. Latency increases significantly with per-chunk LLM calls; batch compression or API streaming mitigates this.

EXERCISE

Implement compressContext and evaluate whether compressed context produces the same factual claims versus uncompressed using an LLM-as-judge factual consistency check. (15 min)