RUNLOCALAIv38
->Will it run?Best GPUCompareTroubleshootStartLearnPulseModelsHardwareToolsBench
Run check
RUNLOCALAI

Independently operated catalog for local-AI hardware and software. Hand-written verdicts. Source-cited claims. Reproducible commands when we have them.

OP·Fredoline Eruo
DIR
  • Models
  • Hardware
  • Tools
  • Benchmarks
TOOLS
  • Will it run?
  • Compare hardware
  • Cost vs cloud
  • Choose my GPU
  • Prompting kits
  • Quick answers
REF
  • All buyer guides
  • Learn local AI
  • Methodology
  • Glossary
  • Errors KB
  • Trust
EDITOR
  • About
  • Author
  • How we make money
  • Editorial policy
  • Contact
LEGAL
  • Privacy
  • Terms
  • Sitemap
MAIL · MONTHLY DIGEST
Get monthly local AI changes
Monthly recap. No spam.
DISCLOSURE

Some links on this site are affiliate links (Amazon Associates and other first-class retailers). When you buy through them, we earn a small commission at no extra cost to you. Affiliate links do not influence our verdicts — there are cards we rate highly that we don't have affiliate relationships with, and cards that sell well that we refuse to recommend. Read more →

© 2026 runlocalai.coIndependently operated
RUNLOCALAI · v38
Glossary / Natural language processing / Question Answering
Natural language processing

Question Answering

Question answering (QA) is a natural language processing task where a model receives a question and returns a concise answer, often extracted from or generated based on provided context. In local AI, QA is implemented by prompting a large language model (LLM) with the question and optional context (e.g., a document or knowledge base). The model outputs an answer as text. Operators encounter QA when using retrieval-augmented generation (RAG) pipelines, where the model answers questions based on retrieved documents. QA performance depends on model size, context length, and the quality of retrieved context. For local models, VRAM constraints limit context size, affecting the amount of information the model can consider.

Deeper dive

QA can be divided into two main types: extractive and generative. Extractive QA selects a span of text from the provided context as the answer, common in older models like BERT. Generative QA, used by modern LLMs, produces free-form text based on the question and context. In local AI, generative QA is the norm. Operators often implement QA via RAG: they index documents into a vector database, retrieve relevant chunks for a question, and feed them as context to an LLM. The model then generates an answer. Key considerations include context window size (e.g., 4K, 8K, 128K tokens) and retrieval quality. Smaller models (e.g., 7B parameters) may struggle with complex reasoning, while larger models (e.g., 70B) require more VRAM. Quantization (e.g., Q4_K_M) reduces model size but may slightly degrade answer quality. Latency varies: a 7B Q4 model on an RTX 4090 can answer in ~1-2 seconds, while a 70B model may take 10-20 seconds.

Practical example

An operator runs a local RAG system using Ollama with Llama 3.1 8B (Q4_K_M) on an RTX 3090 (24 GB VRAM). They upload a PDF manual, which is chunked and embedded into a Chroma vector database. When they ask "What is the maximum operating temperature?", the system retrieves the top 3 chunks (total ~1500 tokens) and feeds them as context to the model. The model generates: "The maximum operating temperature is 85°C." The entire pipeline takes ~3 seconds.

Workflow example

In LM Studio, an operator loads a model (e.g., Mistral 7B) and enables the RAG plugin. They point the plugin to a folder of text files. When they type a question in the chat interface, LM Studio embeds the question, retrieves relevant chunks from the files, and prepends them to the prompt. The model then generates an answer. The operator can adjust the number of retrieved chunks and the context length in the settings. In llama.cpp, they might use the --retrieval flag with a vector database server.

Reviewed by Fredoline Eruo. See our editorial policy.

Buyer guides
  • Best GPU for local AI →
  • Best laptop for local AI →
  • Best Mac for local AI →
When it doesn't work
  • CUDA out of memory →
  • Ollama running slowly →
  • ROCm not detected →