Condensing long documents into shorter summaries — extractive (pulling key sentences) or abstractive (rewriting in fewer words). Long-context capable models excel here.
Lowest-friction path to a working setup.
Operator-grade recommendation.
Failure modes operators see in the wild.
ollama pull llama3.2:3b (~2 GB — fast, light).ollama run llama3.2:3b and paste a long article with the prompt: "Summarize the following article in 3 bullet points."ollama pull qwen2.5:14b (~9 GB, 128K context window).cat long_report.txt | ollama run llama3.2:3b "Summarize in 3 paragraphs:"
Summarization is CPU-friendly for small models. Llama 3.2 3B runs at 20-40 tok/s on a modern laptop CPU (Ryzen 5/Intel i5) — a 5,000-word article summarizes in 30-60 seconds. Any $300 laptop handles light summarization. For heavier use, a used GTX 1060 6 GB ($60) runs Qwen 2.5 7B at 40-60 tok/s — a long report summarizes in 10-15 seconds. If you summarize 100+ documents/day, an RTX 3060 12 GB ($200-250) is worth it for throughput.
Used RTX 3090 24 GB (~$700-900, see /hardware/rtx-3090). Runs Qwen 2.5 32B at 40-60 tok/s — handles 100K-token documents in a single pass (128K context window). Llama 3.3 70B Q4_K_M at 15-25 tok/s for highest-quality summarization. Can process 500+ documents/hour in batch. Pair with Ryzen 7 7700X + 64 GB DDR5 + 2TB NVMe. Total: ~$1,800-2,200. For enterprise document summarization at scale, use vLLM with continuous batching — throughput jumps 3-5× vs. single-request inference.
The mistake: Pasting a 50-page document into a 3B model with a 4K context window and wondering why the summary is incoherent or cuts off mid-sentence. Why it fails: Small models have short context windows (Llama 3.2 3B = 4K tokens = ~3,000 words). They literally cannot read past the first few pages — the rest is silently truncated. The fix: Check your model's context window. For documents >4K tokens, use a model with 32K+ context (Llama 3.1 8B = 128K, Qwen 2.5 7B = 128K, Mistral Nemo 12B = 128K). For documents exceeding even 128K tokens, use a map-reduce chain (split into chunks, summarize each, then summarize the summaries). Never assume the model can read your entire document — verify the token count first.
Browse all tools for runtimes that fit this workload.
Local AI workloads have real hardware constraints that vary by task type. VRAM ceiling decides what model fits; bandwidth decides decode speed; compute decides prefill speed. Pick the GPU tier that fits your actual workload, not the spec sheet.
The errors most operators hit when running summarization locally. Each links to a diagnose+fix walkthrough.
Verify your specific hardware can handle summarization before committing money.