13. Context Compression
Chapter 13 of 22 · 20 min
Large language models have context windows limits, and even when they don't, longer contexts increase latency and cost. Context compression reduces retrieved content to only the most relevant parts.
Maximal Marginal Relevance for Extraction
MMR selects chunks that are both relevant to the query and diverse from each other. This prevents redundant information while ensuring coverage.
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np
def mmr_compress(query_embedding: np.ndarray,
chunk_embeddings: list,
chunks: list,
fetch_k: int = 20,
extract_k: int = 5,
lambda_mult: float = 0.5) -> list:
"""Maximal marginal relevance compression."""
# Get more candidates than we need
candidates = chunks[:fetch_k]
candidate_embs = chunk_embeddings[:fetch_k]
selected = []
selected_embs = []
for _ in range(extract_k):
scores = []
for i, (chunk, emb) in enumerate(zip(candidates, candidate_embs)):
# Relevance to query
query_sim = cosine_similarity([query_embedding], [emb])[0][0]
# Diversity from selected
if selected_embs:
div_scores = cosine_similarity([emb], selected_embs)[0]
max_div = max(div_scores)
else:
max_div = 0
# MMR score: balance relevance vs diversity
mmr_score = lambda_mult * query_sim - (1 - lambda_mult) * max_div
scores.append((i, mmr_score))
# Select highest MMR score
scores.sort(key=lambda x: x[1], reverse=True)
best_idx, _ = scores[0]
selected.append(candidates[best_idx])
selected_embs.append(candidate_embs[best_idx])
# Remove from candidates
del candidates[best_idx]
del candidate_embs[best_idx]
return selected
LLM-Based Compression
For more aggressive compression, use an LLM to extract only relevant sentences or rephrase the content.
def llm_compress(query: str, context: str, max_tokens: int = 500) -> str:
"""Use LLM to extract only relevant content."""
response = openai.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "system", "content": f"""Extract only the information directly relevant
to answering the question. Remove examples, tangents, and background info.
Keep specific facts, numbers, and dates. Output the compressed context in under {max_tokens} tokens."""},
{"role": "user", "content": f"Question: {query}\n\nContext:\n{context}"}
]
)
return response.choices[0].message.content
Sentence-Level Extraction
For precise control, split chunks into sentences and score each against the query.
import re
def sentence_level_compress(query: str, chunks: list, top_k: int = 10) -> str:
"""Extract top-relevant sentences from chunks."""
# Split into sentences
sentences = []
for chunk in chunks:
sentences.extend(re.split(r'(?<=[.!?])\s+', chunk))
# Score each sentence
query_emb = embed_model.encode(query)
sentence_embs = embed_model.encode(sentences)
similarities = cosine_similarity([query_emb], sentence_embs)[0]
# Get top sentences, maintaining original order
top_indices = np.argsort(similarities)[-top_k:]
top_sentences = [sentences[i] for i in sorted(top_indices)]
return " ".join(top_sentences)
EXERCISE
Implement sentence-level compression on a 2000-token context. Compare the output to using the full context in terms of answer quality and token cost.