Natural language processing

Natural Language Processing (NLP)

Natural Language Processing (NLP) is the field of AI focused on enabling computers to understand, interpret, and generate human language. In local AI, NLP tasks include text generation, translation, summarization, and sentiment analysis. Operators encounter NLP through large language models (LLMs) like Llama or Mistral, which process text via tokenization and transformer architectures. The practical constraint is that NLP models require significant VRAM for context windows and model weights, with larger models (e.g., 70B parameters) needing 48 GB or more for full GPU inference.

Deeper dive

NLP has evolved from rule-based systems to statistical methods and now to deep learning, particularly transformers. For local operators, the most relevant NLP tasks are text generation (chatbots, code completion), classification (spam detection), and retrieval-augmented generation (RAG). Models are typically quantized (e.g., Q4_K_M) to fit consumer hardware. Key subfields include tokenization (splitting text into tokens), embeddings (converting words to vectors), and attention mechanisms (weighing word importance). Operators fine-tune models using LoRA or QLoRA for domain-specific tasks. The field also covers speech-to-text (Whisper) and text-to-speech (Bark), which run locally with moderate VRAM (~4-8 GB).

Practical example

An operator running Llama 3.1 8B on an RTX 4090 (24 GB VRAM) uses NLP for real-time chat. At Q4 quantization, the model uses 5 GB, leaving room for a 32K context window (8 GB). Tokens generate at ~40 tok/s. For a 70B model, the same card would need offloading to system RAM, dropping to ~3 tok/s. NLP tasks like summarization of a 10-page document require context management; operators often chunk text and use sliding windows to stay within VRAM limits.

Workflow example

In Ollama, an operator runs ollama run llama3.1:8b to start an NLP inference server. The model loads into VRAM, and the user sends prompts via CLI or API. For RAG, they use ollama pull nomic-embed-text for embeddings, then query a vector database like Chroma. In LM Studio, operators load a model, adjust context length (e.g., 4096 tokens), and monitor VRAM usage in the UI. For fine-tuning, they use unsloth or axolotl with LoRA, applying NLP to domain-specific data (e.g., legal documents).

Reviewed by Fredoline Eruo. See our editorial policy.

Buyer guides

When it doesn't work