Natural language processing

Text Summarization

Text summarization is a natural language processing task where a model generates a shorter version of a longer text while preserving key information. Operators encounter it as a local AI use case for condensing articles, documents, or chat logs. Models like Llama 3.1 8B or Mistral 7B can run summarization on consumer hardware, but output quality and speed depend on context length and VRAM. Shorter input contexts (e.g., 2K tokens) fit easily on 8 GB GPUs; longer documents (e.g., 8K+ tokens) require more VRAM or offloading, slowing tokens/sec.

Deeper dive

Text summarization comes in two main types: extractive and abstractive. Extractive summarization selects key sentences from the original text, while abstractive summarization generates new sentences that paraphrase the content. Most modern local AI models (e.g., Llama, Mistral, Phi) are abstractive, using their generative capabilities to produce summaries. The task is often framed as a prompt: 'Summarize the following text in 3 sentences.' Performance depends on model size, quantization, and prompt engineering. For local operators, summarization is sensitive to context window limits — a model with 8K context can handle a ~6K token article, but a 4K context model may truncate input. Quantization (e.g., Q4_K_M) reduces VRAM usage but may slightly degrade summary coherence. Operators typically test summarization quality by comparing outputs across different models or quantization levels.

Practical example

On an RTX 3090 (24 GB VRAM), running Llama 3.1 8B at Q4_K_M (~5 GB) can summarize a 4K-token article in ~10 seconds at ~40 tok/s. The same model on an 8 GB GPU would offload to system RAM, dropping to ~5 tok/s. For a 70B model, even 24 GB VRAM requires offloading, making summarization of long documents impractical.

Workflow example

In LM Studio, an operator loads a model (e.g., Mistral 7B Q4), pastes a news article, and types 'Summarize this in 2-3 sentences.' The runtime processes the input within the context window; if the article exceeds the window, LM Studio truncates it. In Ollama, the command ollama run llama3.1:8b followed by a summarization prompt works similarly, but the operator must ensure the context size is set via OLLAMA_CONTEXT_LENGTH to avoid truncation.

Reviewed by Fredoline Eruo. See our editorial policy.

Buyer guides

When it doesn't work