Summarization

Capability notes

Long-document summarization splits into **extractive** (selecting existing sentences) and **abstractive** (generating new compressed text). Modern LLMs are natively abstractive; extractive is a subset capability. Abstractive produces 30-50% more compact summaries at the cost of faithfulness risk (hallucinated facts). **Context window is the binding constraint.** Documents exceeding the model's context require chunking — splitting into overlapping segments, summarizing each, then synthesizing. This map-reduce approach introduces compositional error: chunk summaries miss cross-chunk relationships. At 128K context — supported by [Llama 3.3 70B](/models/llama-3-3-70b), [Qwen 3 32B](/models/qwen-3-32b), [DeepSeek V4](/models/deepseek-v4), [Command R+](/models/command-r-plus-08-2024) — full-document summarization without chunking works for documents up to ~96K tokens (~300 pages). Beyond that, context utilization degrades from the "lost in the middle" effect. **Faithfulness**: The core problem — a human cannot verify summary faithfulness at scale (reading the full document takes as long as reading the document). Automated metrics (SummaC, AlignScore, FactCC) correlate with human judgment at 0.70-0.85, meaning 15-30% of factual errors evade automated checks. No current method guarantees 100% faithful summaries. Extractive summarization (selecting verbatim sentences via [BGE-M3](/models/bge-m3) sentence scoring) is the conservative choice when faithfulness is critical. Faithfulness is near-perfect but summaries are longer and less coherent. **Model notes**: [Llama 3.3 70B](/models/llama-3-3-70b) at 128K context is the most reliable open-weight summarizer. [Command R+](/models/command-r-plus-08-2024) outperforms on multi-document summarization but trails on single-document factual consistency. [Qwen 3 235B-A22B](/models/qwen-3-235b-a22b) MoE leads on long-document faithfulness at serving complexity cost. 7B models lose mid-document details at higher rates than 70B — the quality gap on long-document summarization is larger than any other NLP task.

If you just want to try this

Lowest-friction path to a working setup.

Install [Ollama](/tools/ollama), pull [Llama 3.3 70B](/models/llama-3-3-70b) at Q4 (`ollama pull llama3.3:70b`), paste your document with: "Summarize this document in 3 paragraphs. Include only facts from the document. Do not add any information not present in the source." The constraint clause reduces hallucination. Requires ~40 GB combined RAM+VRAM at Q4. If you don't have that, use [Llama 3.3 70B](/models/llama-3-3-70b) via a cloud API (Groq, Together, Fireworks) — latency is ~5-15 seconds for a 50-page document. For documents longer than ~300 pages (exceeding 128K context): split into 10-15 page chunks with 1-page overlap, summarize each chunk, feed all chunk summaries into a final synthesis pass. This adds ~30% compute but produces higher-quality summaries than truncated-document approaches. For extractive-only (conservative, verbatim): use [BGE-M3](/models/bge-m3) via SentenceTransformers — encode all sentences, compute document centroid embedding, select top 20-30 sentences closest to centroid. Produces a topic-representative extract in under 5 seconds on CPU, no GPU needed. Quality is lower (sentences don't flow) but faithfulness is 100%. Don't use a 7B model for summarization. The smaller key-value cache means they "forget" mid-document details at dramatically higher rates than 70B-class models.

For production deployment

Operator-grade recommendation.

Production summarization operates on three axes: document length, faithfulness requirements, and throughput. **Map-reduce pipeline (documents any length, throughput prioritized)**: Split into overlapping chunks (4K-8K tokens, 10% overlap). Summarize each chunk with a 7B-32B model for speed. Synthesize with a 70B model for quality. Gives 70B-class summary quality at ~70% cost of running 70B on the full document. Throughput: 10-30 documents/minute on [RTX 4090](/hardware/rtx-4090). **Long-context pipeline (documents <96K tokens, faithfulness prioritized)**: Feed full document to a 70B+ model with 128K context. Avoids compositional error. Cost 3-5× higher due to O(n²) attention. Use for legal, medical, financial where missing cross-document relationships has compliance consequences. Throughput: 2-5 documents/minute on [RTX 4090](/hardware/rtx-4090). **Hybrid retrieval-augmented (very long documents, domain-specific)**: Index document with [BGE-M3](/models/bge-m3) in vector store. For each section heading, retrieve top 20 relevant sentences. Feed retrieved sentences + heading to LLM for section-level summary. Synthesize section summaries. Handles documents of any length with controlled faithfulness (retrieved sentences are verbatim). Used by [Continue.dev](/tools/continue) for codebase summarization. **Faithfulness verification**: Run NLI model (RoBERTa-large-MNLI) over each factual claim in the summary against the source. Flag claims failing NLI for human review. Adds 20-50% latency, catches 70-85% of hallucinated facts. [BGE Reranker V2 M3](/models/bge-reranker-v2-m3) serves as lightweight faithfulness check. **Deployment**: Summarization microservice behind FastAPI with [vLLM](/tools/vllm) backend. Chunked prefill enabled for long-context. Monitor: summary length, generation time, NLI faithfulness score trend. Queue by priority (interactive before batch).

What breaks

Failure modes operators see in the wild.

- **Mid-document detail loss ("lost in the middle").** LLMs attend more to beginnings and endings. Facts in the middle 40-60% of a document are 30-50% less likely to appear in summaries. Mitigation: multi-pass summarization with different document orderings; ensemble summaries. - **Hallucinated facts.** Model adds plausible-sounding details (person's title, date, causal relationship) absent from source. Most dangerous because they look correct and readers cannot detect them without reading the source. Mitigation: NLI verification pass. Prompt with "if unsure about a detail, omit rather than guess" reduces hallucination 20-40%. - **Positional bias.** Summaries overrepresent first and last 10% of documents. This bias persists across model sizes. Mitigation: multi-pass with different orderings. Explicit instruction: "pay equal attention to all sections, including the middle." - **Impossible faithfulness verification at scale.** 100,000 summaries daily — human verification impossible, automated NLI has 15-30% error. Mitigation: stratified random sampling for human QA (1-5%), track error rate over time. Treat irreducible error as business risk. - **Cross-document contradiction.** Multi-document summaries can self-contradict when sources disagree. Mitigation: contradiction detection pass comparing every summary sentence pair — O(n²) in summary length, expensive but necessary for legal/scientific use. - **Context overflow without warning.** A 130K-token document fed to a 128K model gets silently truncated — the last 2K tokens (often conclusions) discarded. Mitigation: pre-flight token counting, chunk if >90% context window, log warnings on truncation.

Hardware guidance

**Hobbyist ($500-$1,500)**: [RTX 3060 12GB](/hardware/rtx-3060-12gb) or [RTX 4060 Ti 16GB](/hardware/rtx-4060-ti-16gb). Map-reduce only — 7-8B models per chunk at Q4-Q5. CPU+RAM viable with [llama.cpp](/tools/llama-cpp) — 32 GB RAM runs 7B Q4 at 10-20 tok/s. [MacBook Pro 16 M4 Max 64GB](/hardware/macbook-pro-16-m4-max) handles 70B Q4 at 25-35 tok/s via [MLX LM](/tools/mlx-lm) — most compact quality summarization setup. **SMB ($2,000-$4,000)**: [RTX 4090 24GB](/hardware/rtx-4090) or [RTX 5090 32GB](/hardware/rtx-5090). 4090 runs 70B Q4 with partial offload at 15-25 tok/s. 5090 32 GB fits 70B Q4 entirely in VRAM with 16K-32K context. Full-document abstractive at 2-5 docs/min. Map-reduce at 20-50 docs/min. **Enterprise ($8,000-$25,000)**: 2× [RTX 5090](/hardware/rtx-5090) (64 GB total) or [RTX A6000](/hardware/rtx-a6000) 48 GB. Runs 70B Q8 with full 128K context, or [Qwen 3 235B-A22B](/models/qwen-3-235b-a22b) Q4. [NVIDIA L40S](/hardware/nvidia-l40s) 48 GB for sustained 24/7 production with 10-50 concurrent users. **Frontier ($50,000+)**: 4-8× [H100 PCIe](/hardware/nvidia-h100-pcie) or [MI300X](/hardware/amd-mi300x). Runs [DeepSeek V4](/models/deepseek-v4) or [Qwen 3 235B](/models/qwen-3-235b-a22b) at full precision, 128K context, 50+ concurrent users. For internal deployment of 10,000+ employees. **Memory bandwidth matters more than compute** for long-context summarization — attention over 128K tokens is bandwidth-bound. Cards with high bandwidth ([RTX 5090](/hardware/rtx-5090) at 1.79 TB/s, [H100 SXM](/hardware/nvidia-h100-sxm) at 3.35 TB/s) dramatically outperform similar-memory lower-bandwidth cards.

Runtime guidance

**If summarizing for personal use** → [Ollama](/tools/ollama) with [Llama 3.3 70B](/models/llama-3-3-70b) Q4. Handles full-document up to 128K context on 40+ GB combined memory. Apple Silicon: [MLX LM](/tools/mlx-lm) or [LM Studio](/tools/lm-studio). **If building production summarization API** → [vLLM](/tools/vllm) with chunked prefill (`--enable-chunked-prefill`). Critical for long-context — breaks 100K-token prefills into small chunks interleaving with other requests. Without it, one request blocks all others for 5-30 seconds. Set `--max-model-len 131072`. **If needing map-reduce at scale** → Queue (Redis/SQS) → worker pool with vLLM on chunk model (7B-32B, fast) → aggregation worker with vLLM on synthesis model (70B+, slow). Scale workers independently. Pre-filter with [BGE-M3](/models/bge-m3) for sentence-level extractive filtering to improve chunk information density. **If faithfulness > speed** → Two-stage: [Hugging Face Transformers](/tools/transformers) for summarization + NLI verification (RoBERTa-large-MNLI) per summary sentence against source. Doubles inference time but provides per-sentence faithfulness scores. Generate → verify → flag low-confidence → optionally regenerate. **If building RAG summarization** → [BGE-M3](/models/bge-m3) + [BGE Reranker V2 M3](/models/bge-reranker-v2-m3) for extractive baseline, then LLM for abstractive compression. Extractive serves as faithfulness floor — fall back to verbatim sentences if abstractive fails NLI. Store both; serve abstractive by default. **If CPU-only** → [llama.cpp](/tools/llama-cpp) with Q4 7B-13B models. Map-reduce viable at 1-5 documents/hour for full books. Not interactive but functional for overnight batch processing.

Setup walkthrough

Install Ollama → ollama pull llama3.2:3b (~2 GB — fast, light).
ollama run llama3.2:3b and paste a long article with the prompt: "Summarize the following article in 3 bullet points."
The model will produce a concise summary. First response in 2-5 seconds.
For long documents (>8K tokens): use a long-context model: ollama pull qwen2.5:14b (~9 GB, 128K context window).
For batch summarization: pipe documents via CLI:

cat long_report.txt | ollama run llama3.2:3b "Summarize in 3 paragraphs:"

For production: use LangChain or LlamaIndex with a map-reduce or refine summarization chain for documents exceeding the context window.

The cheap setup

Summarization is CPU-friendly for small models. Llama 3.2 3B runs at 20-40 tok/s on a modern laptop CPU (Ryzen 5/Intel i5) — a 5,000-word article summarizes in 30-60 seconds. Any $300 laptop handles light summarization. For heavier use, a used GTX 1060 6 GB ($60) runs Qwen 2.5 7B at 40-60 tok/s — a long report summarizes in 10-15 seconds. If you summarize 100+ documents/day, an RTX 3060 12 GB ($200-250) is worth it for throughput.

The serious setup

Used RTX 3090 24 GB (~$700-900, see /hardware/rtx-3090). Runs Qwen 2.5 32B at 40-60 tok/s — handles 100K-token documents in a single pass (128K context window). Llama 3.3 70B Q4_K_M at 15-25 tok/s for highest-quality summarization. Can process 500+ documents/hour in batch. Pair with Ryzen 7 7700X + 64 GB DDR5 + 2TB NVMe. Total: ~$1,800-2,200. For enterprise document summarization at scale, use vLLM with continuous batching — throughput jumps 3-5× vs. single-request inference.

Common beginner mistake

The mistake: Pasting a 50-page document into a 3B model with a 4K context window and wondering why the summary is incoherent or cuts off mid-sentence. Why it fails: Small models have short context windows (Llama 3.2 3B = 4K tokens = ~3,000 words). They literally cannot read past the first few pages — the rest is silently truncated. The fix: Check your model's context window. For documents >4K tokens, use a model with 32K+ context (Llama 3.1 8B = 128K, Qwen 2.5 7B = 128K, Mistral Nemo 12B = 128K). For documents exceeding even 128K tokens, use a map-reduce chain (split into chunks, summarize each, then summarize the summaries). Never assume the model can read your entire document — verify the token count first.

Recommended setup for summarization

Recommended hardware

Best GPU for local AI →

All workloads ranked across VRAM tiers.

Recommended runtimes

Browse all tools for runtimes that fit this workload.

Budget build

AI PC under $1,000 →

Best GPU for this task

Best GPU for local AI →

Reality check

Local AI workloads have real hardware constraints that vary by task type. VRAM ceiling decides what model fits; bandwidth decides decode speed; compute decides prefill speed. Pick the GPU tier that fits your actual workload, not the spec sheet.

Common mistakes

Buying for spec-sheet VRAM without modeling KV cache + activation overhead
Underestimating quantization quality loss below Q4
Skipping flash-attention support (real perf gap on long context)
Ignoring sustained-load thermals (laptops thermal-throttle within 30 min)

What breaks first

The errors most operators hit when running summarization locally. Each links to a diagnose+fix walkthrough.

Before you buy

Verify your specific hardware can handle summarization before committing money.