Large language models

Semantic Search

Semantic search retrieves results based on meaning rather than exact keyword matches. Instead of looking for literal word occurrences, it converts both the query and documents into vector embeddings using a neural network (often a BERT-style model), then finds the closest vectors via cosine similarity or dot product. This allows queries like "cheap electric cars" to return results about "affordable EVs" even if those exact words are absent. Operators encounter semantic search when setting up retrieval-augmented generation (RAG) pipelines, where a local embedding model (e.g., all-MiniLM-L6-v2) indexes documents and a vector database (e.g., Chroma, FAISS) performs the similarity search.

Deeper dive

Semantic search relies on embedding models that map text to dense vectors in a high-dimensional space. These models are trained on tasks like natural language inference or contrastive learning to place semantically similar texts near each other. The process: (1) embed all documents offline, (2) embed the query at runtime, (3) compute similarity scores (cosine or dot product) between query and document vectors, (4) return top-k results. Key operator choices: embedding model size (e.g., 384-dim vs 768-dim) affects speed and accuracy; quantization (e.g., int8) reduces memory but may lower precision. Local embedding models run on CPU or GPU; on a consumer GPU, a 384-dim model can embed ~1000 docs/sec. For RAG, the retrieval step is typically combined with a generative model (e.g., Llama 3) to answer based on retrieved context. Semantic search differs from lexical search (BM25) which relies on term frequency; hybrid approaches combine both for robustness.

Practical example

An operator building a local RAG system for a codebase uses sentence-transformers/all-MiniLM-L6-v2 (384-dim, ~80 MB) to embed 10,000 code comments. With ChromaDB, the index takes ~150 MB RAM. A query like "how to handle authentication errors" retrieves the top 5 relevant comments in ~50 ms on a CPU. Switching to a larger model like BAAI/bge-large-en-v1.5 (1024-dim, ~1.3 GB) improves accuracy but increases latency to ~200 ms and memory to ~500 MB.

Workflow example

In a typical RAG workflow with Ollama and LangChain, the operator runs ollama pull nomic-embed-text to get a local embedding model. The code calls embeddings = OllamaEmbeddings(model="nomic-embed-text") to embed documents, then stores them in Chroma. At query time, retriever = vectorstore.as_retriever(search_type="similarity", search_kwargs={"k": 4}) fetches relevant chunks. The retrieved context is passed to a chat model like llama3.1:8b for answer generation. The operator can adjust k and similarity threshold to balance recall and precision.

Reviewed by Fredoline Eruo. See our editorial policy.

Buyer guides

When it doesn't work