How to Implement Dynamic Chunk Sizing Based on Content
Custom chunking pipeline, diverse document types to test
What this does
Dynamic chunk sizing adjusts chunk boundaries and sizes based on the structural and semantic characteristics of each document section. Dense narrative text gets larger chunks to preserve flow, while structured content like tables, lists, and code blocks are isolated into dedicated chunks. This adapts the chunking strategy per content type rather than applying a uniform rule across all documents.
Steps
- Classify each document section by content type: prose, heading with short body, table, code block, or list.
- Define size targets per type: prose chunks up to 800 tokens, tables treated as single units regardless of size, code blocks split at the function level not the line level.
- Implement content-type detectors using regex patterns for Markdown headings, HTML tags, or code fence markers.
- For prose sections, apply a variable max chunk size that respects sentence boundaries but allows the chunk to grow up to the target before cutting.
- For tables, extract the full table content and assign it its own chunk, preserving column headers and row structure.
- For code blocks, split at function or class boundaries using a language-aware parser or pattern matching, then assign each unit a chunk.
- Evaluate chunk statistics: average tokens per type, coverage of source content (no data loss), and boundary quality for each type.
Verification
Run your pipeline on a mixed document containing prose paragraphs, a Markdown table, and a Python code block. Verify that the prose chunks do not exceed 800 tokens, that the table is returned as a single chunk with headers intact, and that the code block is split at the function level (e.g., two functions produce two chunks).
Expected output: Prose chunks: avg 620 tokens, max 798. Table chunks: 1 chunk with 6 rows. Code chunks: 2 chunks (function 'validate_config': 45 lines, function 'run_pipeline': 82 lines).
Common failures
- Content type detector misclassifies mixed content: A paragraph containing a table or inline code may be misclassified. Implement a two-pass approach where structural markers are identified first, then content type is assigned to the remaining text.
- Chunk size targets create fragmentation: Aggressive size limits on prose sections cause mid-sentence splits if sentences exceed the limit. Always enforce a minimum chunk size or merge short chunks with the preceding chunk to avoid broken sentences.
- Table extraction loses formatting: Splitting table chunks without preserving column delimiters renders the table unreadable. Always store table chunks with a delimited string format (CSV or pipe-separated) that preserves row and column relationships.
Related guides
- use-semantic-chunking-embedding-similarity
- create-context-aware-chunks-parent