Question Answering — AI glossary

Question answering (QA) is a natural language processing task where a model receives a question and returns a concise answer, often extracted from or generated based on provided context. In local AI, QA is implemented by prompting a large language model (LLM) with the question and optional context (e.g., a document or knowledge base). The model outputs an answer as text. Operators encounter QA when using retrieval-augmented generation (RAG) pipelines, where the model answers questions based on retrieved documents. QA performance depends on model size, context length, and the quality of retrieved context. For local models, VRAM constraints limit context size, affecting the amount of information the model can consider.

Deeper dive

QA can be divided into two main types: extractive and generative. Extractive QA selects a span of text from the provided context as the answer, common in older models like BERT. Generative QA, used by modern LLMs, produces free-form text based on the question and context. In local AI, generative QA is the norm. Operators often implement QA via RAG: they index documents into a vector database, retrieve relevant chunks for a question, and feed them as context to an LLM. The model then generates an answer. Key considerations include context window size (e.g., 4K, 8K, 128K tokens) and retrieval quality. Smaller models (e.g., 7B parameters) may struggle with complex reasoning, while larger models (e.g., 70B) require more VRAM. Quantization (e.g., Q4_K_M) reduces model size but may slightly degrade answer quality. Latency varies: a 7B Q4 model on an RTX 4090 can answer in ~1-2 seconds, while a 70B model may take 10-20 seconds.

Practical example

An operator runs a local RAG system using Ollama with Llama 3.1 8B (Q4_K_M) on an RTX 3090 (24 GB VRAM). They upload a PDF manual, which is chunked and embedded into a Chroma vector database. When they ask "What is the maximum operating temperature?", the system retrieves the top 3 chunks (total ~1500 tokens) and feeds them as context to the model. The model generates: "The maximum operating temperature is 85°C." The entire pipeline takes ~3 seconds.

Workflow example

In LM Studio, an operator loads a model (e.g., Mistral 7B) and enables the RAG plugin. They point the plugin to a folder of text files. When they type a question in the chat interface, LM Studio embeds the question, retrieves relevant chunks from the files, and prepends them to the prompt. The model then generates an answer. The operator can adjust the number of retrieved chunks and the context length in the settings. In llama.cpp, they might use the --retrieval flag with a vector database server.