Chunking

Split documents into smaller pieces for effective vector search.

Chunking divides content into smaller pieces before embedding and storing in a vector database. The strategy you choose affects search quality and retrieval accuracy.

1from kern.knowledge.chunking.semantic_chunking import SemanticChunking
2from kern.knowledge.reader.pdf_reader import PDFReader
3
4reader = PDFReader(
5 chunking_strategy=SemanticChunking(),
6)

Why Chunking Matters

Consider processing a recipe book with different strategies:

StrategyResult
Fixed Size (5000 chars)May split recipes mid-instruction
SemanticKeeps complete recipes together based on meaning
DocumentEach page becomes a chunk

The right strategy returns complete, relevant results. The wrong one returns fragments.

Available Strategies

Using with Readers

Pass a chunking strategy to any reader:

1from kern.knowledge.knowledge import Knowledge
2from kern.knowledge.chunking.fixed_size_chunking import FixedSizeChunking
3from kern.knowledge.reader.pdf_reader import PDFReader
4from kern.vectordb.pgvector import PgVector
5
6reader = PDFReader(
7 chunking_strategy=FixedSizeChunking(chunk_size=3000),
8)
9
10knowledge = Knowledge(
11 vector_db=PgVector(table_name="docs", db_url=db_url),
12)
13
14knowledge.insert(path="documents/", reader=reader)

Choosing a Strategy

Content TypeRecommended StrategyWhy
General textSemanticMaintains meaning and context
Structured docsDocumentPreserves sections and hierarchy
Markdown filesMarkdownRespects heading structure
CSV/tabular dataCSV RowEach row is a logical unit
Source codeCodeSplits at function and class boundaries
Mixed contentRecursiveHandles multiple separator types
Need consistencyFixed SizePredictable chunk dimensions

Each reader has a sensible default, but you can override it based on your content and retrieval needs.

Configuration

Most strategies accept configuration options:

1# Fixed size with overlap
2FixedSizeChunking(
3 chunk_size=5000, # Characters per chunk
4 overlap=200, # Overlap between chunks
5)
6
7# Semantic with threshold
8SemanticChunking(
9 similarity_threshold=0.7, # Lower = more splits
10)
11
12# Recursive with custom separators
13RecursiveChunking(
14 separators=["\n\n", "\n", ". ", " "],
15 chunk_size=4000,
16)

Chunk Size Guidelines

Chunk SizeTrade-off
Small (1000-3000 chars)More precise retrieval, may lose context
Default (5000 chars)Balanced precision and context
Large (8000+ chars)More context, less targeted results

Smaller chunks work better for specific questions. Larger chunks work better when context matters.

Next Steps