13. Context Compression

Chapter 13 of 22 · 20 min

Large language models have context windows limits, and even when they don't, longer contexts increase latency and cost. Context compression reduces retrieved content to only the most relevant parts.

Maximal Marginal Relevance for Extraction

MMR selects chunks that are both relevant to the query and diverse from each other. This prevents redundant information while ensuring coverage.

from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

def mmr_compress(query_embedding: np.ndarray, 
                 chunk_embeddings: list,
                 chunks: list,
                 fetch_k: int = 20,
                 extract_k: int = 5,
                 lambda_mult: float = 0.5) -> list:
    """Maximal marginal relevance compression."""
    
    # Get more candidates than we need
    candidates = chunks[:fetch_k]
    candidate_embs = chunk_embeddings[:fetch_k]
    
    selected = []
    selected_embs = []
    
    for _ in range(extract_k):
        scores = []
        
        for i, (chunk, emb) in enumerate(zip(candidates, candidate_embs)):
            # Relevance to query
            query_sim = cosine_similarity([query_embedding], [emb])[0][0]
            
            # Diversity from selected
            if selected_embs:
                div_scores = cosine_similarity([emb], selected_embs)[0]
                max_div = max(div_scores)
            else:
                max_div = 0
            
            # MMR score: balance relevance vs diversity
            mmr_score = lambda_mult * query_sim - (1 - lambda_mult) * max_div
            scores.append((i, mmr_score))
        
        # Select highest MMR score
        scores.sort(key=lambda x: x[1], reverse=True)
        best_idx, _ = scores[0]
        
        selected.append(candidates[best_idx])
        selected_embs.append(candidate_embs[best_idx])
        
        # Remove from candidates
        del candidates[best_idx]
        del candidate_embs[best_idx]
    
    return selected

LLM-Based Compression

For more aggressive compression, use an LLM to extract only relevant sentences or rephrase the content.

def llm_compress(query: str, context: str, max_tokens: int = 500) -> str:
    """Use LLM to extract only relevant content."""
    
    response = openai.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": f"""Extract only the information directly relevant 
to answering the question. Remove examples, tangents, and background info. 
Keep specific facts, numbers, and dates. Output the compressed context in under {max_tokens} tokens."""},
            {"role": "user", "content": f"Question: {query}\n\nContext:\n{context}"}
        ]
    )
    
    return response.choices[0].message.content

Sentence-Level Extraction

For precise control, split chunks into sentences and score each against the query.

import re

def sentence_level_compress(query: str, chunks: list, top_k: int = 10) -> str:
    """Extract top-relevant sentences from chunks."""
    
    # Split into sentences
    sentences = []
    for chunk in chunks:
        sentences.extend(re.split(r'(?<=[.!?])\s+', chunk))
    
    # Score each sentence
    query_emb = embed_model.encode(query)
    sentence_embs = embed_model.encode(sentences)
    
    similarities = cosine_similarity([query_emb], sentence_embs)[0]
    
    # Get top sentences, maintaining original order
    top_indices = np.argsort(similarities)[-top_k:]
    top_sentences = [sentences[i] for i in sorted(top_indices)]
    
    return " ".join(top_sentences)
EXERCISE

Implement sentence-level compression on a 2000-token context. Compare the output to using the full context in terms of answer quality and token cost.