12. Semantic Search Engine

Chapter 12 of 18 · 15 min

KEY INSIGHT

A semantic search engine combines embedding generation, vector storage, and query handling into a cohesive system. This chapter builds the core architecture. Later chapters add optimization and persistence. ```python from sentence_transformers import SentenceTransformer import chromadb from typing import List, Dict, Optional class SemanticSearchEngine: def __init__(self, model_name: str = "all-MiniLM-L6-v2"): self.model = SentenceTransformer(model_name) self.dimension = self.model.get_sentence_embedding_dimension() self.client = chromadb.PersistentClient(path="./search_index") self.collection = self.client.get_or_create_collection( name="documents", embedding_function=self.model, metadata={"hnsw:space": "cosine"} ) def index_documents( self, documents: List[str], ids: Optional[List[str]] = None, metadatas: Optional[List[Dict]] = None ) -> int: """Index a batch of documents.""" if ids is None: ids = [f"doc_{i}" for i in range(len(documents))] self.collection.add( documents=documents, ids=ids, metadatas=metadatas ) return len(documents) def search( self, query: str, top_k: int = 5, filters: Optional[Dict] = None ) -> List[Dict]: """Search for semantically similar documents.""" results = self.collection.query( query_texts=[query], n_results=top_k, where=filters, include=["documents", "metadatas", "distances"] ) return [ { "id": results['ids'][0][i], "document": results['documents'][0][i], "metadata": results['metadatas'][0][i], "distance": results['distances'][0][i] } for i in range(len(results['ids'][0])) ] def count(self) -> int: return self.collection.count() # Usage example engine = SemanticSearchEngine() # Index documents docs = [ "Python list comprehensions allow concise list creation", "Java streams provide functional programming patterns", "Docker containers package applications with dependencies", "FastAPI makes building REST APIs simple and fast" ] metas = [ {"category": "python", "difficulty": "beginner"}, {"category": "java", "difficulty": "intermediate"}, {"category": "devops", "difficulty": "intermediate"}, {"category": "backend", "difficulty": "beginner"} ] engine.index_documents(docs, metadatas=metas) # Search results = engine.search("functional programming in Java", top_k=2) for r in results: print(f"[{r['id']}] {r['document']} (dist: {r['distance']:.3f})") ``` Output: ``` [doc_1] Java streams provide functional programming patterns (dist: 0.182) [doc_0] Python list comprehensions allow concise list creation (dist: 0.456) ```

Local verification checkpoint

Run the smallest example from this chapter in a local workspace and record the package version, runtime, data path, and observed output. If the result depends on model size, vector count, CPU/GPU backend, or available memory, note that constraint beside the exercise so the lesson remains reproducible.

EXERCISE

Extend the SemanticSearchEngine class with a delete_document(id) method and a clear_index() method. Test all three operations: add, search, delete.