HOW-TO · RAG
How to Batch Ingest Documents into Vector Database
Target environment
Ubuntu 24.04 · Ollama 0.4.x
PREREQUISITES
Vector database running, documents prepared
What this does
Batch ingestion pipelines load a collection of documents, split them into chunks, embed each chunk with a local model served by Ollama, and write the resulting vectors alongside their text and metadata into a vector database. Batching amortizes embedding overhead and allows atomic commits.
Steps
Start Ollama and pull the embedding model.
ollama serve & ollama pull nomic-embed-textConfigure embedding model and chunking.
from langchain_ollama import OllamaEmbeddings from langchain.text_splitter import RecursiveCharacterTextSplitter from langchain_community.document_loaders import DirectoryLoader, TextLoader embed_model = OllamaEmbeddings(model="nomic-embed-text") loader = DirectoryLoader("/data/docs", glob="**/*.txt", loader_cls=TextLoader) docs = loader.load() splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=50) chunks = splitter.split_documents(docs) print(f"Created {len(chunks)} chunks")Connect to the vector store and ingest.
import chromadb from langchain_chroma import Chroma client = chromadb.PersistentClient(path="/data/chroma_store") vectorstore = Chroma( client=client, collection_name="knowledge_base", embedding_function=embed_model, ) uuids = vectorstore.add_documents(documents=chunks) print(f"Indexed {len(uuids)} vectors")Verify the count and query.
collection = client.get_collection("knowledge_base") print(f"Total vectors: {collection.count()}") results = vectorstore.similarity_search("machine learning", k=2) print(f"Results: {[r.page_content[:60] for r in results]}")
Verification
python -c "
import chromadb
c = chromadb.PersistentClient(path='/data/chroma_store')
col = c.get_collection('knowledge_base')
print('Vector count:', col.count())
"
# Expected: Vector count: > 0
Common failures
- ConnectionError to Ollama. Ollama not running. Confirm with
curl http://localhost:11434/api/tags. - Duplicate vectors after re-run. Use upsert semantics or clear collection with
col.delete(where={}). - 0 chunks produced. Directory loader glob mismatched. Check with
ls /data/docs/*.txt. - Chunk size exceeds embedding context. Reduce chunk_size to 512 tokens or fewer.
- Version mismatch - The installed package or runtime differs from the command shown; check the version first and rerun the smallest verification command.
- Local environment drift - Another service, virtual environment, model, or path is being used; print the active binary path and configuration before changing the guide steps.
Operator checkpoint
Before treating this as solved, write down the local runtime, model or package version, hardware/backend if relevant, and the verification output. This keeps the guide useful as a Will-It-Run style decision instead of a one-off command transcript.
Related guides
RELATED GUIDES