RUNLOCALAIv38
->Will it run?Best GPUCompareTroubleshootStartLearnPulseModelsHardwareToolsBench
Run check
RUNLOCALAI

Independently operated catalog for local-AI hardware and software. Hand-written verdicts. Source-cited claims. Reproducible commands when we have them.

OP·Fredoline Eruo
DIR
  • Models
  • Hardware
  • Tools
  • Benchmarks
TOOLS
  • Will it run?
  • Compare hardware
  • Cost vs cloud
  • Choose my GPU
  • Prompting kits
  • Quick answers
REF
  • All buyer guides
  • Learn local AI
  • Methodology
  • Glossary
  • Errors KB
  • Trust
EDITOR
  • About
  • Author
  • How we make money
  • Editorial policy
  • Contact
LEGAL
  • Privacy
  • Terms
  • Sitemap
MAIL · MONTHLY DIGEST
Get monthly local AI changes
Monthly recap. No spam.
DISCLOSURE

Some links on this site are affiliate links (Amazon Associates and other first-class retailers). When you buy through them, we earn a small commission at no extra cost to you. Affiliate links do not influence our verdicts — there are cards we rate highly that we don't have affiliate relationships with, and cards that sell well that we refuse to recommend. Read more →

© 2026 runlocalai.coIndependently operated
RUNLOCALAI · v38
Glossary / Large language models / Semantic Search
Large language models

Semantic Search

Semantic search retrieves results based on meaning rather than exact keyword matches. Instead of looking for literal word occurrences, it converts both the query and documents into vector embeddings using a neural network (often a BERT-style model), then finds the closest vectors via cosine similarity or dot product. This allows queries like "cheap electric cars" to return results about "affordable EVs" even if those exact words are absent. Operators encounter semantic search when setting up retrieval-augmented generation (RAG) pipelines, where a local embedding model (e.g., all-MiniLM-L6-v2) indexes documents and a vector database (e.g., Chroma, FAISS) performs the similarity search.

Deeper dive

Semantic search relies on embedding models that map text to dense vectors in a high-dimensional space. These models are trained on tasks like natural language inference or contrastive learning to place semantically similar texts near each other. The process: (1) embed all documents offline, (2) embed the query at runtime, (3) compute similarity scores (cosine or dot product) between query and document vectors, (4) return top-k results. Key operator choices: embedding model size (e.g., 384-dim vs 768-dim) affects speed and accuracy; quantization (e.g., int8) reduces memory but may lower precision. Local embedding models run on CPU or GPU; on a consumer GPU, a 384-dim model can embed ~1000 docs/sec. For RAG, the retrieval step is typically combined with a generative model (e.g., Llama 3) to answer based on retrieved context. Semantic search differs from lexical search (BM25) which relies on term frequency; hybrid approaches combine both for robustness.

Practical example

An operator building a local RAG system for a codebase uses sentence-transformers/all-MiniLM-L6-v2 (384-dim, ~80 MB) to embed 10,000 code comments. With ChromaDB, the index takes ~150 MB RAM. A query like "how to handle authentication errors" retrieves the top 5 relevant comments in ~50 ms on a CPU. Switching to a larger model like BAAI/bge-large-en-v1.5 (1024-dim, ~1.3 GB) improves accuracy but increases latency to ~200 ms and memory to ~500 MB.

Workflow example

In a typical RAG workflow with Ollama and LangChain, the operator runs ollama pull nomic-embed-text to get a local embedding model. The code calls embeddings = OllamaEmbeddings(model="nomic-embed-text") to embed documents, then stores them in Chroma. At query time, retriever = vectorstore.as_retriever(search_type="similarity", search_kwargs={"k": 4}) fetches relevant chunks. The retrieved context is passed to a chat model like llama3.1:8b for answer generation. The operator can adjust k and similarity threshold to balance recall and precision.

Reviewed by Fredoline Eruo. See our editorial policy.

Buyer guides
  • Best GPU for local AI →
  • Best laptop for local AI →
  • Best Mac for local AI →
When it doesn't work
  • CUDA out of memory →
  • Ollama running slowly →
  • ROCm not detected →