RUNLOCALAIv38
->Will it run?Best GPUCompareTroubleshootStartLearnPulseModelsHardwareToolsBench
Run check
RUNLOCALAI

Independently operated catalog for local-AI hardware and software. Hand-written verdicts. Source-cited claims. Reproducible commands when we have them.

OP·Fredoline Eruo
DIR
  • Models
  • Hardware
  • Tools
  • Benchmarks
TOOLS
  • Will it run?
  • Compare hardware
  • Cost vs cloud
  • Choose my GPU
  • Prompting kits
  • Quick answers
REF
  • All buyer guides
  • Learn local AI
  • Methodology
  • Glossary
  • Errors KB
  • Trust
EDITOR
  • About
  • Author
  • How we make money
  • Editorial policy
  • Contact
LEGAL
  • Privacy
  • Terms
  • Sitemap
MAIL · MONTHLY DIGEST
Get monthly local AI changes
Monthly recap. No spam.
DISCLOSURE

Some links on this site are affiliate links (Amazon Associates and other first-class retailers). When you buy through them, we earn a small commission at no extra cost to you. Affiliate links do not influence our verdicts — there are cards we rate highly that we don't have affiliate relationships with, and cards that sell well that we refuse to recommend. Read more →

© 2026 runlocalai.coIndependently operated
RUNLOCALAI · v38
  1. >
  2. Home
  3. /Learn
  4. /Courses
  5. /LangChain for Local AI
  6. /Ch. 14
LangChain for Local AI

14. Simple RAG Pipeline

Chapter 14 of 18 · 20 min
KEY INSIGHT

RAG pipelines separate knowledge storage (vector database) from knowledge application (LLM generation), enabling accurate responses grounded in your documents.

Retrieval-Augmented Generation combines document retrieval with LLM generation. The pipeline: embed documents, store in vector database, retrieve relevant chunks for user query, pass chunks to LLM with the question.

First, set up the embedding and vector store.

from langchain_ollama import OllamaEmbeddings
from langchain_chroma import Chroma

embeddings = OllamaEmbeddings(
    model="nomic-embed-text",
    base_url="http://localhost:11434"
)

vectorstore = Chroma(
    collection_name="course_content",
    embedding_function=embeddings,
    persist_directory="./chroma_db"
)

Load and chunk documents, then add to the vector store.

from langchain_community.document_loaders import TextLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter

loader = TextLoader("./course_notes.txt")
documents = loader.load()

splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=50)
chunks = splitter.split_documents(documents)

vectorstore.add_documents(chunks)
print(f"Indexed {len(chunks)} chunks")

Now build the retrieval and generation chain.

from langchain_ollama import ChatOllama
from langchain.chains import RetrievalQA
from langchain_core.prompts import PromptTemplate

llm = ChatOllama(model="llama3.2", base_url="http://localhost:11434")

retriever = vectorstore.as_retriever(search_kwargs={"k": 3})

prompt = PromptTemplate.from_template("""
Use the following context to answer the question.

Context: {context}

Question: {question}

Answer:""")

qa_chain = RetrievalQA.from_chain_type(
    llm=llm,
    retriever=retriever,
    chain_type="stuff",  # Stuff all retrieved docs into prompt
    return_source_documents=True
)

result = qa_chain.invoke({"query": "What was the assignment from chapter 3?"})
print(result["result"])
print(f"\nSources: {[doc.metadata for doc in result['source_documents']]}")

The chain_type="stuff" parameter puts all retrieved documents into a single prompt. This works up to ~4 retrieved documents. For larger retrieval sets, use map_reduce which processes documents in batches.

Local verification checkpoint

Run the smallest example from this chapter in a local workspace and record the package version, runtime, data path, and observed output. If the result depends on model size, vector count, CPU/GPU backend, or available memory, note that constraint beside the exercise so the lesson remains reproducible.

Local verification checkpoint

Run the smallest example from this chapter in a local workspace and record the package version, runtime, data path, and observed output. If the result depends on model size, vector count, CPU/GPU backend, or available memory, note that constraint beside the exercise so the lesson remains reproducible.

EXERCISE

Create a RAG pipeline using three separate text files as sources. Query it with a question answerable by only one file, then verify the source documents come from that file.

← Chapter 13
Text Splitters
Chapter 15 →
RetrievalQA Chain