Large language models

Vector Database

A vector database stores and retrieves data as high-dimensional vectors (embeddings) rather than rows or documents. In local AI, it enables semantic search: instead of matching keywords, it finds items whose embeddings are closest to a query embedding, using approximate nearest neighbor (ANN) algorithms. Operators encounter vector databases when building RAG (Retrieval-Augmented Generation) pipelines—they index document chunks as vectors, then retrieve relevant chunks for a language model to answer questions. Popular choices include Chroma, FAISS, and Qdrant, all runnable on local hardware.

Deeper dive

Vector databases are designed for similarity search on embeddings—numerical lists (e.g., 768 or 1536 dimensions) that capture semantic meaning. They index vectors using ANN methods like HNSW (Hierarchical Navigable Small World) or IVF (Inverted File Index) to trade a small accuracy loss for massive speed gains over brute-force search. For local operators, the key constraint is memory: storing millions of vectors at 1536 dimensions each can consume gigabytes of RAM. Most vector databases support on-disk storage with memory-mapping to reduce RAM pressure. They integrate with embedding models (e.g., all-MiniLM-L6-v2 or nomic-embed-text-v1.5) that run locally via ONNX or llama.cpp. In a typical RAG workflow, documents are split into chunks, each chunk is embedded, and the embedding is stored in the vector DB. At query time, the query is embedded, the DB returns the top-k nearest chunks, and those chunks are fed into the LLM as context.

Practical example

A local RAG app indexes 10,000 PDF pages. Each page is embedded into a 384-dimensional vector using all-MiniLM-L6-v2 (~0.1 GB RAM for the model). The vector database (Chroma) stores these 10,000 vectors in a SQLite-backed index, consuming ~15 MB on disk. Querying for "budget forecast" returns the top-5 nearest pages in under 50 ms on a CPU, even without GPU acceleration.

Workflow example

In Ollama, you can run ollama pull nomic-embed-text to get an embedding model, then use a Python script with Chroma: chromadb.Client().create_collection("docs") and collection.add(embeddings=..., documents=...). At query time, embed the question with the same model, call collection.query(query_embeddings=[...], n_results=5), and pass the returned documents to ollama run llama3.1 as context. The whole pipeline runs locally with no cloud dependency.

Reviewed by Fredoline Eruo. See our editorial policy.

Buyer guides

When it doesn't work