08. Document Summarization

Chapter 8 of 18 · 25 min

KEY INSIGHT

Extractive summarization (selecting important sentences) works without LLMs and runs fast; abstractive summarization (generating new text) requires LLMs but produces more coherent output. ### Two Summarization Approaches Extractive summarization selects existing sentences from the document. No language generation requiredΓÇöfaster, more reliable, but may produce choppy output. Abstractive summarization generates new text that paraphrases contentΓÇömore coherent but requires LLMs and may introduce hallucinations. ### Extractive Summarization with TF-IDF Extract the most important sentences using TF-IDF scoring: ```python import fitz import re from sklearn.feature_extraction.text import TfidfVectorizer import numpy as np def extractive_summarize(text, num_sentences=5): # Split into sentences sentences = re.split(r'(?<=[.!?])\s+', text) sentences = [s for s in sentences if len(s) > 20] # Filter short sentences if len(sentences) <= num_sentences: return text # TF-IDF scoring vectorizer = TfidfVectorizer(stop_words='english') tfidf_matrix = vectorizer.fit_transform(sentences) # Score each sentence by sum of TF-IDF values sentence_scores = np.array(tfidf_matrix.sum(axis=1)).flatten() # Get top sentences (by original position, not score order) top_indices = sentence_scores.argsort()[-num_sentences:] top_indices.sort() # Sort by position in document summary = ' '.join(sentences[i] for i in top_indices) return summary # Usage doc = fitz.open("document.pdf") text = doc[0].get_text() doc.close() summary = extractive_summarize(text, num_sentences=5) print(summary) ``` ### LexRank for Better Extraction LexRank uses graph-based ranking similar to Google's PageRank. Often produces more coherent summaries: ```bash pip install sumy ``` ```python from sumy.parsers.plaintext import PlaintextParser from sumy.nlp.tokenizers import Tokenizer from sumy.summarizers.lex_rank import LexRankSummarizer from sumy.nlp.stemmers import Stemmer from sumy.utils import get_stop_words def lexrank_summarize(text, num_sentences=5): parser = PlaintextParser.from_string(text, Tokenizer("english")) stemmer = Stemmer("english") summarizer = LexRankSummarizer(stemmer) summarizer.stop_words = get_stop_words("english") summary = summarizer(parser.document, sentences_count=num_sentences) return ' '.join(str(sentence) for sentence in summary) summary = lexrank_summarize(text) print(summary) ``` ### Abstractive Summarization with Local LLMs For coherent, human-readable summaries, use local LLMs: ```bash pip install llama-cpp-python transformers ``` ```python from llama_cpp import Llama import fitz llm = Llama( model_path="./models/llama-2-7b-chat.gguf", n_ctx=4096, n_threads=4 ) def summarize_llm(text, max_tokens=200): prompt = f"""Summarize the following document in 3-5 sentences: {text[:4000]} Summary:""" response = llm(prompt, max_tokens=max_tokens, temperature=0.3) return response['choices'][0]['text'] doc = fitz.open("document.pdf") text = " ".join(page.get_text() for page in doc) doc.close() summary = summarize_llm(text) print(summary) ``` Temperature 0.3 keeps output factual with minimal hallucination. Higher temperature produces more creative but less reliable summaries. ### Hybrid Approach: Extract + Abstract Combine extractive and abstractive for best results: ```python def hybrid_summarize(text): # First extract key sentences extracted = extractive_summarize(text, num_sentences=10) # Then abstract with LLM summary = summarize_llm(extracted) return summary ``` This approach reduces input length for the LLM (faster, cheaper) while preserving key information. ### Handling Long Documents Documents longer than LLM context require chunking: ```python def chunk_summarize(text, chunk_size=2000, overlap=200): chunks = [] start = 0 while start < len(text): end = start + chunk_size chunks.append(text[start:end]) start = end - overlap # Overlap for continuity # Summarize each chunk chunk_summaries = [summarize_llm(chunk) for chunk in chunks] # Final summary of summaries combined = " ".join(chunk_summaries) return summarize_llm(combined) long_text = "..." # Your full document summary = chunk_summarize(long_text) ``` Overlap ensures context continuity across chunk boundaries.

EXERCISE

Take a 10+ page document (research paper, report, article). Generate summaries using: (1) TF-IDF extractive, (2) LexRank extractive, (3) Local LLM abstractive, (4) hybrid approach. Evaluate each on: coherence (does it read naturally?), coverage (does it capture main points?), and length appropriate for skimming. Identify which approach works best for your use case.