HOW-TO · RAG

How to Create Context-Aware Chunks with Parent Document

advanced25 minBy Fredoline Eruo
Target environment
Ubuntu 24.04 · Ollama 0.4.x
PREREQUISITES

RAG pipeline with hierarchical retrieval support, LangChain installed

What this does

Parent document chunking creates a two-tier retrieval strategy: small, focused child chunks for precise embedding matching, and larger parent chunks that supply full context when a match is confirmed. This approach prevents the common problem of splitting critical context across disconnected pieces, while still allowing fine-grained semantic search. The parent document acts as a boundary that keeps semantically related content together during retrieval.

Steps

  1. Load your source document as a LangChain Document object using the appropriate loader.
  2. Perform initial splitting with a small chunk size (e.g., 200–400 tokens) using RecursiveCharacterTextSplitter, storing the parent document ID as metadata on each child chunk.
  3. Create a mapping dictionary that associates each child chunk with its parent document's full text and metadata.
  4. Embed and index the child chunks into your vector store, preserving the parent ID in each chunk's metadata.
  5. During retrieval, query the vector store to fetch the top-k child chunks based on embedding similarity.
  6. Resolve the parent document IDs from the retrieved child chunks and return the parent documents as the final context to your LLM.

Verification

Run a retrieval query and confirm that the returned context includes full parent document text rather than isolated snippets. Verify metadata by checking that each child chunk contains a parent_id field matching the source document identifier. Log the number of unique parent documents returned versus the number of child chunks retrieved; you should see consolidation (fewer parents than children) when semantically related children cluster under the same parent.

Expected output: Retrieved 3 parent documents for query "deployment errors" with parent_id fields present in returned metadata.

Common failures

  1. Chunk boundaries break semantic units: Headers, table rows, or list items split across chunks cause parent chunks to contain misaligned content. Use semantic splitters that respect structural boundaries.
  2. Parent ID not persisted through indexing: If the vector store integration strips metadata, parent IDs are lost. Verify metadata fields are stored and returned by performing a direct vector store lookup.
  3. Over-retrieval of parent documents: Fetching too many child chunks inflates the number of parent documents, exceeding LLM context limits. Implement a deduplication step that collapses multiple children from the same parent into a single context entry.

Related guides

  • use-semantic-chunking-embedding-similarity
  • build-bm25-vector-hybrid-retrieval