What this does

Batch ingestion pipelines load a collection of documents, split them into chunks, embed each chunk with a local model served by Ollama, and write the resulting vectors alongside their text and metadata into a vector database. Batching amortizes embedding overhead and allows atomic commits.

Steps

Start Ollama and pull the embedding model.

ollama serve &
ollama pull nomic-embed-text

Configure embedding model and chunking.

from langchain_ollama import OllamaEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.document_loaders import DirectoryLoader, TextLoader

embed_model = OllamaEmbeddings(model="nomic-embed-text")
loader = DirectoryLoader("/data/docs", glob="**/*.txt", loader_cls=TextLoader)
docs = loader.load()
splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=50)
chunks = splitter.split_documents(docs)
print(f"Created {len(chunks)} chunks")

Connect to the vector store and ingest.

import chromadb
from langchain_chroma import Chroma

client = chromadb.PersistentClient(path="/data/chroma_store")
vectorstore = Chroma(
    client=client,
    collection_name="knowledge_base",
    embedding_function=embed_model,
)
uuids = vectorstore.add_documents(documents=chunks)
print(f"Indexed {len(uuids)} vectors")

Verify the count and query.

collection = client.get_collection("knowledge_base")
print(f"Total vectors: {collection.count()}")
results = vectorstore.similarity_search("machine learning", k=2)
print(f"Results: {[r.page_content[:60] for r in results]}")

Verification

python -c "
import chromadb
c = chromadb.PersistentClient(path='/data/chroma_store')
col = c.get_collection('knowledge_base')
print('Vector count:', col.count())
"
# Expected: Vector count: > 0

Common failures

ConnectionError to Ollama. Ollama not running. Confirm with curl http://localhost:11434/api/tags.
Duplicate vectors after re-run. Use upsert semantics or clear collection with col.delete(where={}).
0 chunks produced. Directory loader glob mismatched. Check with ls /data/docs/*.txt.
Chunk size exceeds embedding context. Reduce chunk_size to 512 tokens or fewer.
Version mismatch - The installed package or runtime differs from the command shown; check the version first and rerun the smallest verification command.
Local environment drift - Another service, virtual environment, model, or path is being used; print the active binary path and configuration before changing the guide steps.

Operator checkpoint

Before treating this as solved, write down the local runtime, model or package version, hardware/backend if relevant, and the verification output. This keeps the guide useful as a Will-It-Run style decision instead of a one-off command transcript.

How to Batch Ingest Documents into Vector Database

What this does

Steps

Verification

Common failures

Operator checkpoint

Related guides