Large language models

Grounding

Grounding connects a language model's output to verifiable external sources (documents, databases, APIs) to reduce hallucination. In local AI, operators implement grounding by providing the model with retrieved context—often via RAG—so it cites specific passages rather than inventing facts. Grounding doesn't make the model more capable; it constrains the model to answer from supplied material, which is critical when running smaller models (e.g., 7B) that lack broad knowledge.

Deeper dive

Grounding addresses a core weakness of LLMs: they generate plausible-sounding but incorrect statements. By feeding the model relevant text chunks (e.g., from a PDF or website) as part of the prompt, the model's output is anchored to those sources. In practice, grounding is implemented via retrieval-augmented generation (RAG): a retriever fetches relevant documents, and the model receives them as context. Operators running local models often use tools like LangChain or custom scripts to chunk documents, embed them, and retrieve top-k passages. The quality of grounding depends on retrieval accuracy and the model's ability to follow instructions to stay within the provided context. Without grounding, even a well-quantized 13B model may hallucinate on niche topics.

Practical example

An operator runs a local 8B model (e.g., Llama 3.1 8B Q4) to answer questions about a 500-page technical manual. Without grounding, the model might invent procedures. With grounding, the operator uses a RAG pipeline: the manual is chunked into 512-token segments, embedded with a local embedding model (e.g., all-MiniLM-L6-v2), and stored in a vector DB (Chroma). For each query, the top-3 chunks are retrieved and prepended to the prompt. The model then answers based solely on those chunks, reducing hallucination from ~40% to under 10% on factual questions.

Workflow example

In Ollama, grounding is not built-in; operators pair it with a RAG framework. A typical workflow: run ollama pull llama3.1:8b and a separate embedding model. Use a Python script with LangChain: load a PDF, split into chunks, embed with OllamaEmbeddings, store in Chroma. For a query, the retriever fetches relevant chunks, and the script constructs a prompt like 'Answer using only the context below: [chunks] Question: ...' then calls ollama.generate(). The operator sees the model's output stay faithful to the provided context, and can inspect which chunks were used.

Reviewed by Fredoline Eruo. See our editorial policy.

Buyer guides

When it doesn't work