13. Indexing Strategies

Chapter 13 of 18 · 20 min

KEY INSIGHT

How you chunk documents before indexing determines search granularityâ€”too large loses precision, too small loses context. ### Chunking Strategies The most common approach: split documents into overlapping chunks. ```python def chunk_text(text: str, chunk_size: int = 500, overlap: int = 50) -> List[str]: """Split text into overlapping chunks.""" chunks = [] start = 0 text_length = len(text) while start < text_length: end = start + chunk_size chunk = text[start:end] chunks.append(chunk) start = end - overlap # Move back by overlap return chunks # Example usage long_document = """ Python was created by Guido van Rossum and first released in 1991. It emphasizes code readability with its notable use of significant whitespace. Python supports multiple programming styles, including structured, procedural, reflective, object-oriented, and functional programming. It has a large standard library referred to as the "batteries included" philosophy of the Python community. """ chunks = chunk_text(long_document, chunk_size=150, overlap=30) for i, chunk in enumerate(chunks): print(f"Chunk {i}: {chunk[:80]}...") ``` Output: ``` Chunk 0: Python was created by Guido van Rossum and first released in 1991. It empha... Chunk 1: 21. Python supports multiple programming styles, including structured... Chunk 2: 22. Python has a large standard library referred to as the "batteries i... ``` ### Choosing Chunk Size | Use Case | Chunk Size | Reasoning | |----------|------------|-----------| | FAQ / Short answers | 100-200 chars | Each chunk is a complete answer | | Technical docs | 300-500 chars | Capture individual concepts | | Long articles | 500-1000 chars | Balance context and specificity | | Books / Papers | 1000-2000 chars | Maintain paragraph-level context | ### Chunk Metadata Store chunk context in metadata for filtering and display: ```python def index_document_with_chunks( engine: SemanticSearchEngine, doc_id: str, text: str, metadata: Dict, chunk_size: int = 500 ): chunks = chunk_text(text, chunk_size) ids = [f"{doc_id}_chunk_{i}" for i in range(len(chunks))] chunk_metadatas = [ { **metadata, "parent_id": doc_id, "chunk_index": i, "total_chunks": len(chunks), "chunk_text": chunk[:200] # First 200 chars for preview } for i, chunk in enumerate(chunks) ] engine.index_documents(chunks, ids=ids, metadatas=chunk_metadatas) ``` Later, when retrieving results, you can reassemble chunks by parent_id to reconstruct the full document context.

EXERCISE

Take a Wikipedia article (or any long text). Chunk it using three different chunk sizes (100, 500, 1000). Index each version. Query the same question and compare which chunk size returns the most relevant result.