Chunking
Split documents into smaller pieces for effective vector search.
Chunking divides content into smaller pieces before embedding and storing in a vector database. The strategy you choose affects search quality and retrieval accuracy.
1from kern.knowledge.chunking.semantic_chunking import SemanticChunking2from kern.knowledge.reader.pdf_reader import PDFReader34reader = PDFReader(5 chunking_strategy=SemanticChunking(),6)Why Chunking Matters
Consider processing a recipe book with different strategies:
| Strategy | Result |
|---|---|
| Fixed Size (5000 chars) | May split recipes mid-instruction |
| Semantic | Keeps complete recipes together based on meaning |
| Document | Each page becomes a chunk |
The right strategy returns complete, relevant results. The wrong one returns fragments.
Available Strategies
Fixed Size
Split into uniform chunks by character count
Semantic
Split at natural breakpoints based on meaning
Recursive
Split using multiple separators hierarchically
Document
Preserve document structure (sections, pages)
Markdown
Split by heading structure
CSV Row
Each row becomes a chunk
Agentic
AI determines optimal boundaries
Code
Split at function and class boundaries using AST analysis
Custom
Build your own strategy
Using with Readers
Pass a chunking strategy to any reader:
1from kern.knowledge.knowledge import Knowledge2from kern.knowledge.chunking.fixed_size_chunking import FixedSizeChunking3from kern.knowledge.reader.pdf_reader import PDFReader4from kern.vectordb.pgvector import PgVector56reader = PDFReader(7 chunking_strategy=FixedSizeChunking(chunk_size=3000),8)910knowledge = Knowledge(11 vector_db=PgVector(table_name="docs", db_url=db_url),12)1314knowledge.insert(path="documents/", reader=reader)Choosing a Strategy
| Content Type | Recommended Strategy | Why |
|---|---|---|
| General text | Semantic | Maintains meaning and context |
| Structured docs | Document | Preserves sections and hierarchy |
| Markdown files | Markdown | Respects heading structure |
| CSV/tabular data | CSV Row | Each row is a logical unit |
| Source code | Code | Splits at function and class boundaries |
| Mixed content | Recursive | Handles multiple separator types |
| Need consistency | Fixed Size | Predictable chunk dimensions |
Each reader has a sensible default, but you can override it based on your content and retrieval needs.
Configuration
Most strategies accept configuration options:
1# Fixed size with overlap2FixedSizeChunking(3 chunk_size=5000, # Characters per chunk4 overlap=200, # Overlap between chunks5)67# Semantic with threshold8SemanticChunking(9 similarity_threshold=0.7, # Lower = more splits10)1112# Recursive with custom separators13RecursiveChunking(14 separators=["\n\n", "\n", ". ", " "],15 chunk_size=4000,16)Chunk Size Guidelines
| Chunk Size | Trade-off |
|---|---|
| Small (1000-3000 chars) | More precise retrieval, may lose context |
| Default (5000 chars) | Balanced precision and context |
| Large (8000+ chars) | More context, less targeted results |
Smaller chunks work better for specific questions. Larger chunks work better when context matters.