COURSE · FND · B012

Vector Stores and Embeddings

Learn vector stores and embeddings through RunLocalAI's practical lens: embeddings, vector, chromadb and faiss, hardware fit, runtime settings, verification habits and local-vs-cloud tradeoffs.

18 chapters8hFoundations trackBy Fredoline Eruo
PREREQUISITES
  • B002
  • B011

Course B012: Vector Stores and Embeddings

Why this course exists

Traditional search matches keywords. You search "cat veterinarian" and get pages containing those exact words. But what if you want to find documents about "taking your pet to the animal doctor"? Keyword search fails because it relies on exact string matching.

Vector databases solve this by converting text into numbers—specifically, into points in high-dimensional space. Documents about similar topics cluster together. When you search, your query becomes a point, and the database finds the nearest neighbors.

This course builds the infrastructure for semantic search: converting text to vectors, storing those vectors efficiently, and retrieving relevant results at speed. You will implement this locally without external API dependencies.

What you will know after

You will understand how embeddings represent meaning as vectors. You will index thousands of documents in ChromaDB and FAISS. You will filter by metadata, optimize for speed, and build a working semantic search engine that handles 10,000+ documents with sub-100ms queries.

CHAPTERS
  1. 01What is an Embedding?An embedding converts text into a list of 768+ numbers that capture semantic meaning—documents with similar meaning have similar numbers. When you feed the sentence "The cat sat on the mat" into an embedding model, you do not get back text. You get back a vector: a list of floating-point numbers. ```python from sentence_transformers import SentenceTransformer model = SentenceTransformer('all-MiniLM-L6-v2') # This produces a list of 384 numbers embedding = model.encode("The cat sat on the mat") print(f"Vector dimension: {len(embedding)}") print(f"First 10 values: {embedding[:10]}") ``` Output: ``` Vector dimension: 384 First 10 values: [ 0.02931074 -0.01298371 0.06068995 0.02664177 0.00397757 0.02698916 0.0092486 -0.04706898 0.02984722 -0.02698026] ``` These numbers mean nothing to humans, but they encode semantic relationships. The sentence "A feline rested on the rug" produces a vector close to the one above—different words, similar meaning, similar numbers. The distance between two vectors tells you how semantically similar the underlying text is. You measure this with cosine similarity or Euclidean distance: ```python import numpy as np def cosine_similarity(a, b): return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b)) sentence1 = model.encode("The cat sat on the mat") sentence2 = model.encode("A feline rested on the rug") sentence3 = model.encode("Quantum physics is fascinating") print(f"Cat vs Cat (similar): {cosine_similarity(sentence1, sentence2):.4f}") print(f"Cat vs Physics (dissimilar): {cosine_similarity(sentence1, sentence3):.4f}") ``` Output: ``` Cat vs Cat (similar): 0.7421 Cat vs Physics (dissimilar): 0.0923 ``` The model learned that cats and felines are related concepts. Physics and cats are not. Embeddings turn the fuzzy problem of "meaning" into the precise problem of "geometric distance." This makes search mathematical instead of lexical.20 min
  2. 02Embedding Models ComparedModel choice affects search quality more than database choice—MiniLM balances speed and quality for most local use cases. Embedding models vary in three dimensions: dimension size, quality, and speed. Larger dimensions capture more nuance but slow down similarity calculations. The most common open-source models: | Model | Dimensions | Quality Score | Speed (docs/sec) | |-------|------------|---------------|------------------| | all-MiniLM-L6-v2 | 384 | 68.4 | 14,000 | | paraphrase-Multilingual-MiniLM-L12-v2 | 384 | 70.1 | 10,000 | | all-mpnet-base-v2 | 768 | 72.7 | 2,800 | | bge-base-en-v1.5 | 768 | 73.4 | 3,200 | Benchmark from MTEB leaderboard (mass-text-embedding-benchmark). Higher quality scores indicate better retrieval performance. ```python from sentence_transformers import SentenceTransformer import time models = [ 'all-MiniLM-L6-v2', 'all-mpnet-base-v2', 'BAAI/bge-base-en-v1.5' ] test_corpus = ["Sample document number {}".format(i) for i in range(1000)] for model_name in models: model = SentenceTransformer(model_name) start = time.time() embeddings = model.encode(test_corpus, show_progress_bar=False) elapsed = time.time() - start docs_per_second = len(test_corpus) / elapsed print(f"{model_name}: {docs_per_second:.0f} docs/sec, shape {embeddings.shape}") ``` Expect output like: ``` all-MiniLM-L6-v2: 14235 docs/sec, shape (1000, 384) all-mpnet-base-v2: 2810 docs/sec, shape (1000, 768) BAAI/bge-base-en-v1.5: 3150 docs/sec, shape (1000, 768) ``` For local development and most production use cases, `all-MiniLM-L6-v2` is the practical choice. It encodes 14,000 documents per second on a modern laptop. If you need maximum quality and can tolerate 5x slower encoding, `bge-base-en-v1.5` is the better choice. Multilingual support matters if you index documents in multiple languages. The multilingual model handles English and 50+ other languages in the same vector space—English "cat" and Spanish "gato" end up close together. ```python # Multilingual model usage multilingual = SentenceTransformer('paraphrase-multilingual-MiniLM-L12-v2') en_embedding = multilingual.encode("The cat") es_embedding = multilingual.encode("El gato") de_embedding = multilingual.encode("Die Katze") print(f"English-Spanish similarity: {cosine_similarity(en_embedding, es_embedding):.4f}") print(f"English-German similarity: {cosine_similarity(en_embedding, de_embedding):.4f}") ```20 min
  3. 03ChromaDB SetupChromaDB stores vectors alongside metadata and provides a Pythonic interface for vector operations without external services. ChromaDB is a dedicated vector database that runs entirely in-process. No server, no Docker, no API keys. It persists data to disk as SQLite under the hood. Install it with pip: ```bash pip install chromadb==0.4.22 ``` The client connects to a persistent database: ```python import chromadb from chromadb.config import Settings # Creates ./chroma_db directory if it doesn't exist client = chromadb.PersistentClient(path="./chroma_db") # Or in-memory (data lost on restart) client = chromadb.Client() ``` ChromaDB version matters. Version 0.4.x changed the API significantly from 0.3.x. The examples here use 0.4.22. Common setup error: trying to use ChromaDB with an older version of the `chromadb` package that conflicts with another dependency. If you see `ImportError: cannot import name 'Client' from 'chromadb'`, check your installed version: ```bash pip show chromadb ``` The ChromaDB package was renamed from `chromadb` to `chromadb`—confusing, but make sure you have the correct package installed and not the deprecated `chroma-db` or `chromadb-old` packages from testing phases.15 min
  4. 04ChromaDB CollectionsA collection is a container for related documents—think of it as a table in SQL, but for vectors. Collections hold documents, their embeddings, and metadata. Each collection has a name and a specific embedding function. All documents in a collection use the same embedding model. ```python import chromadb from chromadb.config import Settings client = chromadb.PersistentClient(path="./chroma_db") # Create a collection collection = client.create_collection( name="articles", metadata={"description": "Technical articles about programming"} ) print(f"Collection ID: {collection.id}") print(f"Collection name: {collection.name}") ``` If a collection with that name already exists, `create_collection` raises an error. Use `get_or_create_collection` instead: ```python # Safe: gets existing or creates new collection = client.get_or_create_collection( name="articles", metadata={"description": "Technical articles about programming"} ) ``` Listing and inspecting collections: ```python # List all collections all_collections = client.list_collections() for col in all_collections: print(f"Name: {col.name}, ID: {col.id}, Count: {col.count()}") ``` Output: ``` Name: articles, ID: 1a2b3c4d..., Count: 0 Name: support_tickets, ID: 5e6f7g8h..., Count: 1523 ``` Deleting a collection is permanent and immediate: ```python client.delete_collection(name="articles") ```20 min
  5. 05Adding DocumentsChromaDB accepts raw text and generates embeddings automatically when you provide an embedding function. You can let ChromaDB handle embedding generation or provide pre-computed embeddings. The automatic approach is simpler; the manual approach offers more control. ### Automatic Embedding ```python import chromadb from sentence_transformers import SentenceTransformer client = chromadb.PersistentClient(path="./chroma_db") # Create collection with embedding function collection = client.get_or_create_collection( name="docs", embedding_function=SentenceTransformer('all-MiniLM-L6-v2') ) # Add documents collection.add( documents=[ "How to reset a forgotten password", "Password reset not working after email change", "Contact customer support for account recovery", "Set up two-factor authentication" ], ids=["doc1", "doc2", "doc3", "doc4"], metadatas=[ {"category": "auth", "priority": "high"}, {"category": "auth", "priority": "high"}, {"category": "support", "priority": "medium"}, {"category": "security", "priority": "medium"} ] ) print(f"Collection count: {collection.count()}") ``` ### Manual Embedding When you want to reuse embeddings across systems or use a different embedding model: ```python import chromadb import numpy as np client = chromadb.PersistentClient(path="./chroma_db") collection = client.get_or_create_collection(name="docs_manual") # Pre-compute embeddings model = SentenceTransformer('all-MiniLM-L6-v2') docs = ["First document text", "Second document text"] embeddings = model.encode(docs) collection.add( documents=docs, embeddings=embeddings.tolist(), # ChromaDB needs list, not numpy array ids=["manual1", "manual2"] ) ``` Common failure: passing numpy arrays directly instead of converting to lists. ChromaDB's internal serialization expects Python lists. ```python # This fails: collection.add(embeddings=embeddings) # numpy array # This works: collection.add(embeddings=embeddings.tolist()) # list of lists ```15 min
  6. 06Similarity SearchQuery the collection with natural language and retrieve the most semantically similar documents. The `query` method takes a query text (or pre-computed query vector) and returns the `n` most similar results. ```python import chromadb client = chromadb.PersistentClient(path="./chroma_db") collection = client.get_collection(name="docs") # Query for semantically similar documents results = collection.query( query_texts=["I cannot log into my account"], n_results=3 ) print("Query results:") for i, (doc, distance, doc_id) in enumerate(zip( results['documents'][0], results['distances'][0], results['ids'][0] )): print(f"\n{i+1}. [ID: {doc_id}, Distance: {distance:.4f}]") print(f" {doc}") ``` Output: ``` Query results: 1. [ID: doc2, Distance: 0.2341] Password reset not working after email change 2. [ID: doc1, Distance: 0.3122] How to reset a forgotten password 3. [ID: doc3, Distance: 0.5891] Contact customer support for account recovery ``` The distance metric depends on how the collection was configured. Default is squared L2 (Euclidean distance). Lower distance means more similar. You can also query with pre-computed embeddings: ```python from sentence_transformers import SentenceTransformer model = SentenceTransformer('all-MiniLM-L6-v2') query_embedding = model.encode("I cannot log into my account") results = collection.query( query_embeddings=[query_embedding.tolist()], n_results=3 ) ``` Returning results includes metadata if you included it during insertion: ```python results = collection.query( query_texts=["security best practices"], n_results=2, include=["documents", "metadatas", "distances"] ) for doc, meta, dist in zip( results['documents'][0], results['metadatas'][0], results['distances'][0] ): print(f"Document: {doc}") print(f"Metadata: {meta}") print(f"Distance: {dist:.4f}\n") ```20 min
  7. 07Metadata FilteringPre-filter documents by metadata before similarity search to scope results to relevant subsets. ChromaDB supports `where` filtering to restrict queries to documents matching specific metadata criteria. The filter runs before similarity search, narrowing the candidate set. ```python import chromadb client = chromadb.PersistentClient(path="./chroma_db") collection = client.get_or_create_collection( name="knowledge_base", embedding_function=SentenceTransformer('all-MiniLM-L6-v2') ) # Add documents with various metadata collection.add( documents=[ "How to install Python 3.11 on Ubuntu", "Python installation guide for Windows", "Docker container setup tutorial", "Kubernetes deployment best practices", "React component lifecycle explained", "Building REST APIs with FastAPI" ], ids=["p1", "p2", "d1", "k1", "r1", "f1"], metadatas=[ {"category": "python", "difficulty": "beginner", "rating": 4.5}, {"category": "python", "difficulty": "beginner", "rating": 4.2}, {"category": "devops", "difficulty": "intermediate", "rating": 4.8}, {"category": "devops", "difficulty": "advanced", "rating": 4.6}, {"category": "frontend", "difficulty": "intermediate", "rating": 4.3}, {"category": "backend", "difficulty": "intermediate", "rating": 4.7} ] ) # Filter by single metadata field results = collection.query( query_texts=["containers and deployment"], n_results=3, where={"category": "devops"} # Only search devops documents ) print("DevOps results:") for doc in results['documents'][0]: print(f" - {doc}") ``` Output: ``` DevOps results: - Docker container setup tutorial - Kubernetes deployment best practices ``` Compound filters use operators `$eq`, `$ne`, `$gt`, `$gte`, `$lt`, `$lte`: ```python # Filter by category AND difficulty results = collection.query( query_texts=["programming tutorials"], n_results=3, where={ "category": "python", "difficulty": {"$gte": "intermediate"} # difficulty >= "intermediate" } ) # Filter with OR logic using $or results = collection.query( query_texts=["tutorials"], n_results=5, where={ "$or": [ {"category": {"$eq": "python"}}, {"category": {"$eq": "frontend"}} ] } ) ``` Metadata filtering is effective but has limits. ChromaDB loads all matching documents into memory before vector search. For large-scale filtering (millions of documents), consider segmenting into separate collections per category.20 min
  8. 08FAISS InstallationFAISS provides GPU-accelerated nearest-neighbor search on billions of vectors—install via conda for CUDA support or pip for CPU-only. FAISS (Facebook AI Similarity Search) is a library for efficient similarity search and clustering of dense vectors. Unlike ChromaDB, FAISS is a library, not a database. You manage indexing and persistence yourself. ### CPU Installation (pip) ```bash pip install faiss-cpu ``` ### GPU Installation (conda recommended) FAISS GPU support requires CUDA and works best with conda: ```bash conda install -c pytorch faiss-gpu cudatoolkit=11.7 ``` The conda approach handles CUDA library dependencies that pip cannot manage. ### Verification ```python import faiss import numpy as np print(f"FAISS version: {faiss.__version__}") # Create a simple index dimension = 128 index = faiss.IndexFlatL2(dimension) # L2 distance index # Generate random test vectors vectors = np.random.random((1000, dimension)).astype('float32') # Add vectors to index index.add(vectors) print(f"Index size: {index.ntotal} vectors") print(f"Is trained: {index.is_trained}") ``` Output: ``` FAISS version: 1.7.4 Index size: 1000 vectors Is trained: True ``` The flat index stores all vectors exactly—no compression, no approximation. For 1000 vectors, this is fine. For 10 million vectors, you need a different index type.20 min
  9. 09FAISS Index TypesFAISS offers dozens of index types—flat indexes for accuracy, IVF indexes for speed, HNSW for memory-efficient approximate search. Choosing an index means trading off speed, memory, accuracy, and build time. ### IndexFlatL2 Exact nearest neighbor search using brute force. No approximation. ```python import faiss import numpy as np dimension = 384 index = faiss.IndexFlatL2(dimension) # Add 50,000 vectors vectors = np.random.random((50000, dimension)).astype('float32') index.add(vectors) # Query query = np.random.random((1, dimension)).astype('float32') k = 5 # Number of nearest neighbors distances, indices = index.search(query, k) print(f"Nearest indices: {indices}") print(f"Distances: {distances}") ``` Accurate but slow for large datasets. O(n) search time. ### IndexIVFFlat Inverted file index with clustering. Faster search by limiting candidates to nearby clusters. ```python import faiss dimension = 384 nlist = 100 # Number of clusters # Create quantizer (inner product index for L2) quantizer = faiss.IndexFlatL2(dimension) # nlist clusters, measure distance with L2 index = faiss.IndexIVFFlat(quantizer, dimension, nlist, faiss.METRIC_L2) # Must train before adding vectors training_vectors = np.random.random((10000, dimension)).astype('float32') index.train(training_vectors) # Now add vectors vectors = np.random.random((50000, dimension)).astype('float32') index.add(vectors) # Set nprobe (clusters to search) index.nprobe = 10 # Search 10 nearest clusters distances, indices = index.search(query, k) ``` IVF reduces search time from O(n) to O(n/nlist + nprobe * k). With 100 clusters and nprobe=10, you search roughly 10% of the data. ### IndexHNSWFlat Hierarchical Navigable Small World graph. Excellent speed with good accuracy. ```python import faiss dimension = 384 M = 32 # Number of connections per node index = faiss.IndexHNSWFlat(dimension, M, faiss.METRIC_L2) index.hnsw.efConstruction = 40 # Build-time quality index.hnsw.efSearch = 64 # Search-time quality vectors = np.random.random((50000, dimension)).astype('float32') index.add(vectors) distances, indices = index.search(query, k) ``` HNSW is typically the fastest for in-memory search but uses more RAM. Memory usage is approximately `dimension * ntotal * 4 bytes * (1 + M/2)`.20 min
  10. 10IVF vs HNSWIVF scales better with dataset size; HNSW offers better query latency for in-memory data. Choose based on your data size and latency requirements. ### Quantitative Comparison Testing with 500,000 384-dimensional vectors: ```python import faiss import numpy as np import time dimension = 384 n_vectors = 500000 # Generate test data np.random.seed(42) vectors = np.random.random((n_vectors, dimension)).astype('float32') queries = np.random.random((100, dimension)).astype('float32') # IndexFlatL2 (baseline) print("IndexFlatL2 (exact):") flat = faiss.IndexFlatL2(dimension) flat.add(vectors) start = time.time() for q in queries: flat.search(q.reshape(1, -1), 10) elapsed = time.time() - start print(f" Total time: {elapsed:.2f}s, per query: {elapsed*10:.1f}ms") # IndexIVFFlat print("\nIndexIVFFlat (nlist=4096, nprobe=40):") quantizer = faiss.IndexFlatL2(dimension) ivf = faiss.IndexIVFFlat(quantizer, dimension, 4096) ivf.train(vectors[:100000]) # Train on subset ivf.add(vectors) ivf.nprobe = 40 start = time.time() for q in queries: ivf.search(q.reshape(1, -1), 10) elapsed = time.time() - start print(f" Total time: {elapsed:.2f}s, per query: {elapsed*10:.1f}ms") # IndexHNSWFlat print("\nIndexHNSWFlat (M=32, ef=128):") hnsw = faiss.IndexHNSWFlat(dimension, 32) hnsw.hnsw.efSearch = 128 hnsw.add(vectors) start = time.time() for q in queries: hnsw.search(q.reshape(1, -1), 10) elapsed = time.time() - start print(f" Total time: {elapsed:.2f}s, per query: {elapsed*10:.1f}ms") ``` Typical results on CPU: | Index | Build Time | Memory | Query/Latency | |-------|------------|--------|---------------| | FlatL2 | <1s | 768MB | ~500ms | | IVFFlat | ~10s | 770MB | ~50ms | | HNSW | ~30s | ~1200MB | ~5ms | ### When to Use Each **Use FlatL2 when:** - Dataset < 10,000 vectors - You need exact results - Memory is not constrained **Use IVFFlat when:** - Dataset is millions of vectors - Memory is limited - You can tolerate 1-5% recall loss **Use HNSW when:** - Sub-10ms queries are required - Dataset fits in RAM - Memory is not severely constrained ### Combining IVF and HNSW FAISS supports composite indexes: use HNSW as the IVF quantizer for faster clustering: ```python # HNSW-based quantizer for IVF hnsw_quantizer = faiss.IndexHNSWFlat(dimension, 32) index = faiss.IndexIVFFlat(hnsw_quantizer, dimension, nlist=1024) index.train(vectors) index.add(vectors) ```15 min
  11. 11FAISS with LangChainLangChain provides unified abstractions over ChromaDB and FAISS—same code works with either backend. LangChain's `VectorStore` abstraction lets you swap backends without changing application code. ### ChromaDB via LangChain ```python from langchain.vectorstores import Chroma from langchain.embeddings import SentenceTransformerEmbeddings embeddings = SentenceTransformerEmbeddings(model_name="all-MiniLM-L6-v2") # LangChain handles collection management vectorstore = Chroma( client=client, # Your ChromaDB client collection_name="langchain_docs", embedding_function=embeddings ) # Add documents vectorstore.add_texts( texts=["Document one content", "Document two content"], ids=["id1", "id2"] ) # Similarity search results = vectorstore.similarity_search("query text", k=3) ``` ### FAISS via LangChain ```python from langchain.vectorstores import FAISS from langchain.embeddings import SentenceTransformerEmbeddings embeddings = SentenceTransformerEmbeddings(model_name="all-MiniLM-L6-v2") # Create empty index, then add vectors vectorstore = FAISS.from_texts( texts=["Document one content", "Document two content"], embedding=embeddings, metadatas=[{"source": "doc1"}, {"source": "doc2"}] ) # Save to disk vectorstore.save_local("./faiss_index") # Load from disk loaded = FAISS.load_local("./faiss_index", embeddings) ``` ### Unified Interface Both backends share the same interface: ```python # Same code works for ChromaDB and FAISS def search_vectorstore(vectorstore, query, k=5): """Search both ChromaDB and FAISS with identical code.""" return vectorstore.similarity_search(query, k=k) # Works with either backend chroma_results = search_vectorstore(chroma_store, "password reset") faiss_results = search_vectorstore(faiss_store, "password reset") ``` LangChain also provides `as_retriever()` for integration with chains: ```python retriever = vectorstore.as_retriever( search_type="similarity", search_kwargs={"k": 5, "filter": {"category": "support"}} ) # Use in a chain from langchain.chains import RetrievalQA qa_chain = RetrievalQA.from_chain_type( llm=llm, chain_type="stuff", retriever=retriever ) ``` The abstraction is convenient but adds a dependency. For production systems where you need specific features (like ChromaDB's metadata filtering), use the native client directly.20 min
  12. 12Semantic Search EngineA semantic search engine combines embedding generation, vector storage, and query handling into a cohesive system. This chapter builds the core architecture. Later chapters add optimization and persistence. ```python from sentence_transformers import SentenceTransformer import chromadb from typing import List, Dict, Optional class SemanticSearchEngine: def __init__(self, model_name: str = "all-MiniLM-L6-v2"): self.model = SentenceTransformer(model_name) self.dimension = self.model.get_sentence_embedding_dimension() self.client = chromadb.PersistentClient(path="./search_index") self.collection = self.client.get_or_create_collection( name="documents", embedding_function=self.model, metadata={"hnsw:space": "cosine"} ) def index_documents( self, documents: List[str], ids: Optional[List[str]] = None, metadatas: Optional[List[Dict]] = None ) -> int: """Index a batch of documents.""" if ids is None: ids = [f"doc_{i}" for i in range(len(documents))] self.collection.add( documents=documents, ids=ids, metadatas=metadatas ) return len(documents) def search( self, query: str, top_k: int = 5, filters: Optional[Dict] = None ) -> List[Dict]: """Search for semantically similar documents.""" results = self.collection.query( query_texts=[query], n_results=top_k, where=filters, include=["documents", "metadatas", "distances"] ) return [ { "id": results['ids'][0][i], "document": results['documents'][0][i], "metadata": results['metadatas'][0][i], "distance": results['distances'][0][i] } for i in range(len(results['ids'][0])) ] def count(self) -> int: return self.collection.count() # Usage example engine = SemanticSearchEngine() # Index documents docs = [ "Python list comprehensions allow concise list creation", "Java streams provide functional programming patterns", "Docker containers package applications with dependencies", "FastAPI makes building REST APIs simple and fast" ] metas = [ {"category": "python", "difficulty": "beginner"}, {"category": "java", "difficulty": "intermediate"}, {"category": "devops", "difficulty": "intermediate"}, {"category": "backend", "difficulty": "beginner"} ] engine.index_documents(docs, metadatas=metas) # Search results = engine.search("functional programming in Java", top_k=2) for r in results: print(f"[{r['id']}] {r['document']} (dist: {r['distance']:.3f})") ``` Output: ``` [doc_1] Java streams provide functional programming patterns (dist: 0.182) [doc_0] Python list comprehensions allow concise list creation (dist: 0.456) ```15 min
  13. 13Indexing StrategiesHow you chunk documents before indexing determines search granularity—too large loses precision, too small loses context. ### Chunking Strategies The most common approach: split documents into overlapping chunks. ```python def chunk_text(text: str, chunk_size: int = 500, overlap: int = 50) -> List[str]: """Split text into overlapping chunks.""" chunks = [] start = 0 text_length = len(text) while start < text_length: end = start + chunk_size chunk = text[start:end] chunks.append(chunk) start = end - overlap # Move back by overlap return chunks # Example usage long_document = """ Python was created by Guido van Rossum and first released in 1991. It emphasizes code readability with its notable use of significant whitespace. Python supports multiple programming styles, including structured, procedural, reflective, object-oriented, and functional programming. It has a large standard library referred to as the "batteries included" philosophy of the Python community. """ chunks = chunk_text(long_document, chunk_size=150, overlap=30) for i, chunk in enumerate(chunks): print(f"Chunk {i}: {chunk[:80]}...") ``` Output: ``` Chunk 0: Python was created by Guido van Rossum and first released in 1991. It empha... Chunk 1: 21. Python supports multiple programming styles, including structured... Chunk 2: 22. Python has a large standard library referred to as the "batteries i... ``` ### Choosing Chunk Size | Use Case | Chunk Size | Reasoning | |----------|------------|-----------| | FAQ / Short answers | 100-200 chars | Each chunk is a complete answer | | Technical docs | 300-500 chars | Capture individual concepts | | Long articles | 500-1000 chars | Balance context and specificity | | Books / Papers | 1000-2000 chars | Maintain paragraph-level context | ### Chunk Metadata Store chunk context in metadata for filtering and display: ```python def index_document_with_chunks( engine: SemanticSearchEngine, doc_id: str, text: str, metadata: Dict, chunk_size: int = 500 ): chunks = chunk_text(text, chunk_size) ids = [f"{doc_id}_chunk_{i}" for i in range(len(chunks))] chunk_metadatas = [ { **metadata, "parent_id": doc_id, "chunk_index": i, "total_chunks": len(chunks), "chunk_text": chunk[:200] # First 200 chars for preview } for i, chunk in enumerate(chunks) ] engine.index_documents(chunks, ids=ids, metadatas=chunk_metadatas) ``` Later, when retrieving results, you can reassemble chunks by parent_id to reconstruct the full document context.20 min
  14. 14Batch IngestionProcess documents in batches of 100-500 to balance memory usage and throughput during large-scale indexing. Loading thousands of documents one at a time is slow. Loading all at once exhausts memory. Batch processing finds the balance. ```python from typing import List, Dict, Generator import time def batch_generator(items: List, batch_size: int) -> Generator[List, None, None]: """Yield batches of items.""" for i in range(0, len(items), batch_size): yield items[i:i + batch_size] def ingest_documents( engine: SemanticSearchEngine, documents: List[Dict], batch_size: int = 100 ) -> Dict: """Ingest documents in batches with progress reporting.""" total = len(documents) start_time = time.time() texts = [doc["text"] for doc in documents] ids = [doc["id"] for doc in documents] metadatas = [doc.get("metadata", {}) for doc in documents] processed = 0 for batch in batch_generator(range(total), batch_size): batch_texts = [texts[i] for i in batch] batch_ids = [ids[i] for i in batch] batch_metas = [metadatas[i] for i in batch] engine.index_documents( batch_texts, ids=batch_ids, metadatas=batch_metas ) processed += len(batch) elapsed = time.time() - start_time rate = processed / elapsed if elapsed > 0 else 0 print(f"Progress: {processed}/{total} ({100*processed/total:.1f}%) " f"- {rate:.1f} docs/sec") return { "total": total, "elapsed": time.time() - start_time } ``` ### Parallel Embedding Generation For CPU-bound embedding generation, use multiprocessing: ```python from sentence_transformers import SentenceTransformer from multiprocessing import Pool, cpu_count from functools import partial def encode_batch(texts: List[str], model_name: str) -> List: """Encode a batch of texts (worker function).""" model = SentenceTransformer(model_name) return model.encode(texts).tolist() def parallel_encode(texts: List[str], model_name: str, num_workers: int = None) -> List: """Encode texts in parallel using multiple processes.""" num_workers = num_workers or max(1, cpu_count() - 1) batch_size = 100 batches = list(batch_generator(texts, batch_size)) with Pool(num_workers) as pool: results = pool.map( partial(encode_batch, model_name=model_name), batches ) # Flatten results return [embedding for batch in results for embedding in batch] ``` ### Progress Tracking For long-running ingestion, track progress in a file: ```python import json from pathlib import Path def ingest_with_checkpoint( engine: SemanticSearchEngine, documents: List[Dict], checkpoint_file: str = ".ingestion_checkpoint.json", batch_size: int = 100 ): """Ingest documents with checkpointing for recovery.""" checkpoint_path = Path(checkpoint_file) # Load checkpoint if exists if checkpoint_path.exists(): with open(checkpoint_path) as f: checkpoint = json.load(f) start_index = checkpoint["indexed_count"] print(f"Resuming from checkpoint: {start_index} already indexed") else: start_index = 0 # Process remaining documents for i in range(start_index, len(documents), batch_size): batch = documents[i:i + batch_size] engine.index_documents( [d["text"] for d in batch], ids=[d["id"] for d in batch], metadatas=[d.get("metadata", {}) for d in batch] ) # Save checkpoint with open(checkpoint_path, 'w') as f: json.dump({"indexed_count": i + len(batch)}, f) ```20 min
  15. 15Embedding CachingCache embeddings to avoid recomputing them when documents haven't changed—dramatically speeds up re-indexing. Embedding computation is the expensive part. If you re-index the same documents, regenerating embeddings wastes time. ### Hash-Based Cache ```python import hashlib import json from pathlib import Path class EmbeddingCache: def __init__(self, cache_dir: str = ".embedding_cache"): self.cache_dir = Path(cache_dir) self.cache_dir.mkdir(exist_ok=True) self.index_file = self.cache_dir / "index.json" self._load_index() def _load_index(self): if self.index_file.exists(): with open(self.index_file) as f: self.index = json.load(f) else: self.index = {} def _save_index(self): with open(self.index_file, 'w') as f: json.dump(self.index, f, indent=2) def _compute_hash(self, text: str) -> str: """Hash document text to create cache key.""" return hashlib.sha256(text.encode()).hexdigest() def _get_cache_path(self, doc_id: str) -> Path: """Get path for cached embedding.""" return self.cache_dir / f"{doc_id}.npy" def get(self, doc_id: str, text: str, embedding_model) -> list: """Get embedding from cache or compute and cache it.""" text_hash = self._compute_hash(text) if doc_id in self.index and self.index[doc_id]["hash"] == text_hash: # Cache hit cache_path = self._get_cache_path(doc_id) if cache_path.exists(): import numpy as np return np.load(cache_path).tolist() # Cache miss - compute embedding import numpy as np embedding = embedding_model.encode([text])[0].tolist() # Save to cache np.save(self._get_cache_path(doc_id), np.array(embedding)) self.index[doc_id] = {"hash": text_hash} self._save_index() return embedding def invalidate(self, doc_id: str): """Remove document from cache.""" if doc_id in self.index: del self.index[doc_id] cache_path = self._get_cache_path(doc_id) if cache_path.exists(): cache_path.unlink() self._save_index() def clear(self): """Clear entire cache.""" for path in self.cache_dir.glob("*.npy"): path.unlink() self.index = {} self._save_index() ``` ### Usage with Search Engine ```python class CachedSemanticSearchEngine(SemanticSearchEngine): def __init__(self, model_name: str = "all-MiniLM-L6-v2"): super().__init__(model_name) self.cache = EmbeddingCache() def index_documents(self, documents, ids=None, metadatas=None): if ids is None: ids = [f"doc_{i}" for i in range(len(documents))] # Get embeddings (from cache or computed) embeddings = [ self.cache.get(doc_id, doc, self.model) for doc_id, doc in zip(ids, documents) ] self.collection.add( documents=documents, embeddings=embeddings, ids=ids, metadatas=metadatas ) return len(documents) def reindex_document(self, doc_id: str, document: str, metadata: Dict = None): """Re-index a single document, updating cache automatically.""" self.cache.invalidate(doc_id) return self.index_documents([document], ids=[doc_id], metadatas=[metadata]) ``` ### Cache Statistics ```python def cache_stats(cache: EmbeddingCache) -> Dict: """Get statistics about cache usage.""" import os cache_files = list(cache.cache_dir.glob("*.npy")) total_size = sum(f.stat().st_size for f in cache_files) return { "cached_documents": len(cache_files), "total_size_mb": total_size / (1024 * 1024), "index_entries": len(cache.index) } ```20 min
  16. 16Performance OptimizationThe main bottlenecks are embedding computation and vector search—optimize embeddings first, then database operations. ### Profiling Before optimizing, measure where time goes: ```python import time from functools import wraps def profile(func): """Decorator to time function execution.""" @wraps(func) def wrapper(*args, **kwargs): start = time.time() result = func(*args, **kwargs) elapsed = time.time() - start print(f"{func.__name__}: {elapsed:.3f}s") return result return wrapper class ProfilingSearchEngine(SemanticSearchEngine): @profile def index_documents(self, documents, ids=None, metadatas=None): return super().index_documents(documents, ids, metadatas) @profile def search(self, query, top_k=5, filters=None): return super().search(query, top_k, filters) ``` ### Optimizing Embedding Speed ```python # 1. Use batch encoding (already default in ChromaDB) batch_embeddings = model.encode(docs, batch_size=256, show_progress_bar=True) # 2. Pre-encode common queries COMMON_QUERIES = [ "How do I reset my password?", "Where can I find my invoice?", "How do I contact support?" ] query_cache = {q: model.encode(q) for q in COMMON_QUERIES} def cached_query(query): if query in query_cache: return query_cache[query] return model.encode(query) # 3. Use half-precision for storage (reduce memory) embeddings_fp16 = embeddings.astype('float16') # Half the memory ``` ### Optimizing ChromaDB Queries ```python # 1. Limit result set to what you need results = collection.query( query_texts=[query], n_results=10, # Don't request more than you need ) # 2. Only include fields you use results = collection.query( query_texts=[query], n_results=5, include=["documents"] # Skip metadata and distances if not needed ) # 3. Use approximate search for large collections collection = client.create_collection( name="large_collection", embedding_function=model, metadata={"hnsw:construction_ef": 100, "hnsw:search_ef": 100} ) ``` ### Optimizing FAISS ```python # 1. Use IVFPQ for compression with large datasets nlist = 100 # Number of clusters m = 16 # Number of subquantizers index = faiss.IndexIVFPQ(quantizer, dimension, nlist, m, 8) # 8 bits per vector # 2. Tune nprobe for IVF indexes # Start with sqrt(nlist), adjust based on recall requirements index.nprobe = 10 # Default index.nprobe = 50 # Higher recall, lower speed # 3. Use index_cpu_to_gpu for GPU acceleration if faiss.get_num_gpus() > 0: gpu_index = faiss.index_cpu_to_gpu(res, 0, index) ``` ### Memory Optimization ```python # Monitor memory usage import psutil import os def get_memory_mb(): process = psutil.Process(os.getpid()) return process.memory_info().rss / 1024 / 1024 print(f"Memory before: {get_memory_mb():.1f} MB") # Clear ChromaDB client to free memory del collection del client print(f"Memory after: {get_memory_mb():.1f} MB") ```20 min
  17. 17ChromaDB PersistenceChromaDB persists automatically, but understanding the underlying storage format helps with backup and migration. ### Automatic Persistence When you use `PersistentClient`, ChromaDB saves to disk after each write operation: ```python import chromadb # This automatically persists to ./my_database client = chromadb.PersistentClient(path="./my_database") collection = client.create_collection("docs") # Add data collection.add(documents=["Test"], ids=["1"]) # Data is persisted immediately # On restart, data is available ``` ### Understanding the Storage Format ``` ./my_database/ ├── 062d4c40-1d54-4feb-bf7b-282f6c3e2e2a/ │ ├── header.json │ ├── data_level-0.db # SQLite database │ └── index/ │ └── 062d4c40-..._embedded.u8data # HNSW index └── ... ``` The SQLite database stores documents and metadata. The index folder stores the HNSW vectors. ### Manual Backup ```python import shutil from datetime import datetime def backup_chroma(client_path: str, backup_path: str = None): """Create a timestamped backup of ChromaDB data.""" if backup_path is None: timestamp = datetime.now().strftime("%Y%m%d_%H%M%S") backup_path = f"./backup_{timestamp}" shutil.copytree(client_path, backup_path) print(f"Backup created at: {backup_path}") return backup_path # Create backup backup = backup_chroma("./my_database") ``` ### Restoration ```python def restore_chroma(backup_path: str, restore_path: str): """Restore ChromaDB from backup.""" import shutil import os # Remove existing database if present if os.path.exists(restore_path): shutil.rmtree(restore_path) shutil.copytree(backup_path, restore_path) return chromadb.PersistentClient(path=restore_path) # Restore from backup client = restore_chroma("./backup_20240115_143022", "./my_database") ``` ### Incremental Export For large databases, export incrementally: ```python def export_collection_to_json(collection, output_file: str, batch_size: int = 1000): """Export collection to JSON for migration or backup.""" import json total = collection.count() exported = 0 with open(output_file, 'w') as f: f.write('[\n') while exported < total: results = collection.get( include=["documents", "metadatas", "embeddings"], limit=batch_size, offset=exported ) for i in range(len(results['ids'])): record = { "id": results['ids'][i], "document": results['documents'][i], "metadata": results['metadatas'][i], "embedding": results['embeddings'][i] } json.dump(record, f) f.write(',\n' if exported + i + 1 < total else '\n') exported += len(results['ids']) print(f"Exported {exported}/{total}") f.write(']') # Export collection = client.get_collection("docs") export_collection_to_json(collection, "./export.json") ``` ### Import from Export ```python def import_collection_from_json(filepath: str, collection): """Import collection from JSON export.""" import json with open(filepath) as f: records = json.load(f) for i in range(0, len(records), 100): batch = records[i:i+100] collection.add( documents=[r["document"] for r in batch], embeddings=[r["embedding"] for r in batch], ids=[r["id"] for r in batch], metadatas=[r["metadata"] for r in batch] ) print(f"Imported {min(i+100, len(records))}/{len(records)}") ```20 min
  18. 18Search Engine ProjectBuild a complete document Q&A system by combining semantic search with retrieval-augmented generation. This final chapter integrates everything: document indexing, metadata filtering, semantic search, and presenting results. ```python from sentence_transformers import SentenceTransformer import chromadb from typing import List, Dict, Optional import hashlib import json from pathlib import Path class DocumentQASystem: """ Complete document Q&A system with semantic search. Supports: - Batch document ingestion - Metadata filtering - Semantic similarity search - Result ranking and confidence scoring """ def __init__( self, persist_directory: str = "./qa_index", model_name: str = "all-MiniLM-L6-v2" ): self.model = SentenceTransformer(model_name) self.dimension = self.model.get_sentence_embedding_dimension() # Initialize ChromaDB self.client = chromadb.PersistentClient(path=persist_directory) self.collection = self.client.get_or_create_collection( name="documents", embedding_function=self.model, metadata={"hnsw:space": "cosine"} ) # Initialize cache self.cache_dir = Path(persist_directory) / "cache" self.cache_dir.mkdir(exist_ok=True) self.cache_index = self.cache_dir / "index.json" self._load_cache_index() def _load_cache_index(self): if self.cache_index.exists(): with open(self.cache_index) as f: self.cache = json.load(f) else: self.cache = {} def _save_cache_index(self): with open(self.cache_index, 'w') as f: json.dump(self.cache, f) def ingest( self, documents: List[str], metadatas: Optional[List[Dict]] = None, batch_size: int = 100 ) -> int: """Ingest documents with progress reporting.""" if metadatas is None: metadatas = [{}] * len(documents) ids = [self._generate_id(doc) for doc in documents] total_indexed = 0 for i in range(0, len(documents), batch_size): batch_docs = documents[i:i + batch_size] batch_ids = ids[i:i + batch_size] batch_metas = metadatas[i:i + batch_size] self.collection.add( documents=batch_docs, ids=batch_ids, metadatas=batch_metas ) total_indexed += len(batch_docs) print(f"Indexed {total_indexed}/{len(documents)} documents") return total_indexed def _generate_id(self, text: str) -> str: """Generate deterministic ID from content hash.""" return hashlib.sha256(text.encode()).hexdigest()[:16] def search( self, query: str, top_k: int = 5, filters: Optional[Dict] = None, min_score: float = 0.0 ) -> List[Dict]: """Search for relevant documents.""" results = self.collection.query( query_texts=[query], n_results=top_k, where=filters, include=["documents", "metadatas", "distances"] ) documents = [] for i in range(len(results['ids'][0])): distance = results['distances'][0][i] # Convert distance to similarity score (0-1, higher is better) similarity = 1 / (1 + distance) if similarity >= min_score: documents.append({ "id": results['ids'][0][i], "content": results['documents'][0][i], "metadata": results['metadatas'][0][i], "similarity": similarity, "distance": distance }) return documents def ask( self, question: str, context_docs: int = 3, filters: Optional[Dict] = None ) -> Dict: """ Answer a question by finding relevant documents. Returns the most relevant documents and suggests an answer based on retrieved context. """ relevant_docs = self.search( question, top_k=context_docs, filters=filters, min_score=0.1 ) if not relevant_docs: return { "answer": "No relevant documents found.", "sources": [], "question": question } # Build context from top documents context = "\n\n".join([ f"[Source {i+1}]: {doc['content']}" for i, doc in enumerate(relevant_docs) ]) # Format response return { "question": question, "answer": f"Based on {len(relevant_docs)} relevant source(s):\n\n{context}", "sources": [ { "content": doc['content'][:200] + "..." if len(doc['content']) > 200 else doc['content'], "metadata": doc['metadata'], "confidence": f"{doc['similarity']:.2%}" } for doc in relevant_docs ], "total_found": len(relevant_docs) } def stats(self) -> Dict: """Get index statistics.""" return { "total_documents": self.collection.count(), "embedding_dimension": self.dimension, "model": self.model.model_name, "collection_name": self.collection.name } # Demo usage if __name__ == "__main__": # Initialize system qa = DocumentQASystem(persist_directory="./demo_qa") # Sample documents documents = [ ("Python was created by Guido van Rossum in 1991.", {"topic": "python", "year": 1991}), ("Python supports multiple programming styles including OOP.", {"topic": "python", "concept": "styles"}), ("FastAPI is a modern Python web framework for building APIs.", {"topic": "fastapi", "category": "framework"}), ("ChromaDB is a vector database for AI applications.", {"topic": "chromadb", "category": "database"}), ("FAISS is a library for efficient similarity search.", {"topic": "faiss", "category": "library"}), ("Embeddings convert text to numerical vectors.", {"topic": "embeddings", "concept": "vectors"}), ("Docker containers package applications with their dependencies.", {"topic": "docker", "category": "devops"}), ("Kubernetes automates deployment and scaling of containers.", {"topic": "kubernetes", "category": "devops"}), ] # Ingest documents print("Ingesting documents...") texts = [d[0] for d in documents] metas = [d[1] for d in documents] qa.ingest(texts, metas) # Show stats print(f"\nIndex stats: {qa.stats()}") # Run queries print("\n" + "="*60) print("QUERY 1: 'Tell me about Python programming'") print("="*60) result = qa.ask("Tell me about Python programming") print(result["answer"]) print(f"\nConfidence scores: {[s['confidence'] for s in result['sources']]}") print("\n" + "="*60) print("QUERY 2: 'What is vector database technology?'") print("="*60) result = qa.ask("What is vector database technology?") print(result["answer"]) print("\n" + "="*60) print("QUERY 3: Filter by topic='devops'") print("="*60) result = qa.ask("deployment and scaling", filters={"topic": {"$eq": "devops"}}) print(result["answer"]) ```25 min