RUNLOCALAIv38
->Will it run?Best GPUCompareTroubleshootStartLearnPulseModelsHardwareToolsBench
Run check
RUNLOCALAI

Independently operated catalog for local-AI hardware and software. Hand-written verdicts. Source-cited claims. Reproducible commands when we have them.

OP·Fredoline Eruo
DIR
  • Models
  • Hardware
  • Tools
  • Benchmarks
TOOLS
  • Will it run?
  • Compare hardware
  • Cost vs cloud
  • Choose my GPU
  • Prompting kits
  • Quick answers
REF
  • All buyer guides
  • Learn local AI
  • Methodology
  • Glossary
  • Errors KB
  • Trust
EDITOR
  • About
  • Author
  • How we make money
  • Editorial policy
  • Contact
LEGAL
  • Privacy
  • Terms
  • Sitemap
MAIL · MONTHLY DIGEST
Get monthly local AI changes
Monthly recap. No spam.
DISCLOSURE

Some links on this site are affiliate links (Amazon Associates and other first-class retailers). When you buy through them, we earn a small commission at no extra cost to you. Affiliate links do not influence our verdicts — there are cards we rate highly that we don't have affiliate relationships with, and cards that sell well that we refuse to recommend. Read more →

© 2026 runlocalai.coIndependently operated
RUNLOCALAI · v38
Glossary / Frameworks & tools / LlamaIndex
Frameworks & tools

LlamaIndex

LlamaIndex is a data framework for building retrieval-augmented generation (RAG) applications. It provides tools to ingest, index, and query external data (documents, databases, APIs) alongside a local LLM. Operators use it to connect a local model to their own data without sending it to a cloud service. LlamaIndex handles chunking documents, creating vector embeddings, and managing a query engine that retrieves relevant context before prompting the LLM. It runs entirely locally when paired with a local embedding model and a local LLM, making it suitable for private data workflows.

Deeper dive

LlamaIndex structures the RAG pipeline into components: readers (load data from PDFs, websites, etc.), transforms (chunk text, generate embeddings), indexes (store embeddings in a vector store), and retrievers (fetch relevant chunks at query time). It supports multiple index types: vector index (semantic search), summary index (concatenate all chunks), keyword index (BM25), and hybrid. Operators can configure chunk size (e.g., 512 tokens) and overlap (e.g., 20 tokens) to balance retrieval granularity and context window usage. LlamaIndex integrates with local LLMs via llama.cpp, Ollama, or Hugging Face Transformers, and with local embedding models like BAAI/bge-small-en-v1.5. It also offers a chat engine for multi-turn conversations and a query engine for single-turn retrieval. The framework abstracts away boilerplate but exposes knobs for advanced tuning, such as top-k retrieval, similarity cutoff, and reranking.

Practical example

An operator has a 500-page technical manual in PDF and wants to ask questions using a local LLM (e.g., Llama 3.1 8B). With LlamaIndex, they write a Python script that loads the PDF, chunks it into 512-token segments, generates embeddings using a local model like BAAI/bge-small-en-v1.5 (runs on CPU or GPU), and stores them in a local vector store (ChromaDB). At query time, the top-5 chunks are retrieved and fed into the LLM as context. The whole pipeline runs on a single RTX 3060 12GB, with embedding generation taking ~2 seconds per 1000 chunks and query latency ~3 seconds.

Workflow example

In a typical workflow, an operator runs pip install llama-index and writes a script: from llama_index.core import VectorStoreIndex, SimpleDirectoryReader; documents = SimpleDirectoryReader('./docs').load_data(); index = VectorStoreIndex.from_documents(documents); query_engine = index.as_query_engine(); response = query_engine.query('What is the torque spec for bolt A?'). The framework uses a default embedding model (e.g., BAAI/bge-small-en-v1.5) and a default LLM (e.g., gpt-3.5-turbo), but operators override these with local endpoints: Settings.llm = Ollama(model='llama3.1:8b') and Settings.embed_model = HuggingFaceEmbedding(model_name='BAAI/bge-small-en-v1.5'). The index is persisted to disk for reuse.

Reviewed by Fredoline Eruo. See our editorial policy.

Buyer guides
  • Best GPU for local AI →
  • Best laptop for local AI →
  • Best Mac for local AI →
When it doesn't work
  • CUDA out of memory →
  • Ollama running slowly →
  • ROCm not detected →