KEY INSIGHT
Irrelevant tokens in retrieved context dilute generation quality; compressing to relevant sentences improves signal-to-noise ratio.
### The Noise-in-Context Problem
Retrieved chunks often contain the answer but also surrounding context that misdirects or confuses the LLM. Context compression uses an LLM to extract only the relevant sentences from each chunk, reducing token cost and improving relevance.
### LLM-Based Compression
```python
from openai import OpenAI
client = OpenAI()
def compressChunk(chunk_text: str, query: str, model: str = "gpt-4o-mini") -> str:
system_prompt = (
"You are a precise technical assistant. Given a document passage "
"and a user question, extract ONLY the sentences directly relevant "
"to answering the question. Discard background, commentary, and "
"tangential content. Preserve code blocks if relevant. "
"Return the compressed passage with no additional text."
)
response = client.chat.completions.create(
model=model,
messages=[
{"role": "system", "content": system_prompt},
{"role": "user", "content": f"Question: {query}\n\nPassage: {chunk_text}"}
],
temperature=0.0,
max_tokens=512
)
return response.choices[0].message.content.strip()
def compressContext(
chunks: list[dict],
query: str,
model: str = "gpt-4o-mini"
) -> list[dict]:
compressed = []
for chunk in chunks:
compressed_text = compressChunk(chunk["text"], query, model)
if compressed_text: # Only include non-empty results
compressed.append({
**chunk,
"text": compressed_text,
"original_length": len(chunk["text"]),
"compressed_length": len(compressed_text)
})
return compressed
```
### Token Savings Measurement
```python
def measureCompressionSavings(
original_chunks: list[dict],
compressed_chunks: list[dict]
) -> dict:
original_tokens = sum(c["original_length"] for c in compressed_chunks)
compressed_tokens = sum(c["compressed_length"] for c in compressed_chunks)
# Use rough estimate: 4 chars per token
orig_est = original_tokens // 4
comp_est = compressed_tokens // 4
savings = (orig_est - comp_est) / orig_est if orig_est > 0 else 0
return {
"original_tokens_estimate": orig_est,
"compressed_tokens_estimate": comp_est,
"savings_percent": round(savings * 100, 1)
}
```
### Condensing via Extract-and-Synthesize
A two-pass approach: extract relevant facts first, then synthesize into a dense paragraph.
```python
def condenseContext(chunks: list[dict], query: str) -> str:
# Pass 1: Extract all relevant snippets
extracted = [compressChunk(c["text"], query) for c in chunks]
combined = "\n\n".join(extracted)
# Pass 2: Synthesize into coherent context
synthesis_prompt = (
"Synthesize the following extracted passages into a single "
"coherent technical summary that answers the question. "
"Preserve all factual claims. Output only the summary."
)
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{"role": "system", "content": synthesis_prompt},
{"role": "user", "content": f"Question: {query}\n\n{combined}"}
],
temperature=0.0,
max_tokens=1024
)
return response.choices[0].message.content.strip()
```
### Failure Modes
Compression can over-compress, removing context needed for cross-sentence references or pronouns. Always verify the compressed context includes entity names that the original generation step needed. Latency increases significantly with per-chunk LLM calls; batch compression or API streaming mitigates this.