Chunking

Split documents into smaller pieces for effective vector search.

Chunking divides content into smaller pieces before embedding and storing in a vector database. The strategy you choose affects search quality and retrieval accuracy.

1from kern.knowledge.chunking.semantic_chunking import SemanticChunking
2from kern.knowledge.reader.pdf_reader import PDFReader
3
4reader = PDFReader(
5    chunking_strategy=SemanticChunking(),
6)

Why Chunking Matters

Consider processing a recipe book with different strategies:

Strategy	Result
Fixed Size (5000 chars)	May split recipes mid-instruction
Semantic	Keeps complete recipes together based on meaning
Document	Each page becomes a chunk

The right strategy returns complete, relevant results. The wrong one returns fragments.

Available Strategies

ruler

Fixed Size

Split into uniform chunks by character count

brain

Semantic

Split at natural breakpoints based on meaning

sitemap

Recursive

Split using multiple separators hierarchically

file-lines

Document

Preserve document structure (sections, pages)

markdown

Markdown

Split by heading structure

table

CSV Row

Each row becomes a chunk

robot

Agentic

AI determines optimal boundaries

file-code

Code

Split at function and class boundaries using AST analysis

code

Custom

Build your own strategy

Using with Readers

Pass a chunking strategy to any reader:

1from kern.knowledge.knowledge import Knowledge
2from kern.knowledge.chunking.fixed_size_chunking import FixedSizeChunking
3from kern.knowledge.reader.pdf_reader import PDFReader
4from kern.vectordb.pgvector import PgVector
5
6reader = PDFReader(
7    chunking_strategy=FixedSizeChunking(chunk_size=3000),
8)
9
10knowledge = Knowledge(
11    vector_db=PgVector(table_name="docs", db_url=db_url),
12)
13
14knowledge.insert(path="documents/", reader=reader)

Choosing a Strategy

Content Type	Recommended Strategy	Why
General text	Semantic	Maintains meaning and context
Structured docs	Document	Preserves sections and hierarchy
Markdown files	Markdown	Respects heading structure
CSV/tabular data	CSV Row	Each row is a logical unit
Source code	Code	Splits at function and class boundaries
Mixed content	Recursive	Handles multiple separator types
Need consistency	Fixed Size	Predictable chunk dimensions

Each reader has a sensible default, but you can override it based on your content and retrieval needs.

Configuration

Most strategies accept configuration options:

1# Fixed size with overlap
2FixedSizeChunking(
3    chunk_size=5000,       # Characters per chunk
4    overlap=200,           # Overlap between chunks
5)
6
7# Semantic with threshold
8SemanticChunking(
9    similarity_threshold=0.7,  # Lower = more splits
10)
11
12# Recursive with custom separators
13RecursiveChunking(
14    separators=["\n\n", "\n", ". ", " "],
15    chunk_size=4000,
16)

Chunk Size Guidelines

Chunk Size	Trade-off
Small (1000-3000 chars)	More precise retrieval, may lose context
Default (5000 chars)	Balanced precision and context
Large (8000+ chars)	More context, less targeted results

Smaller chunks work better for specific questions. Larger chunks work better when context matters.

Next Steps

brain

Semantic Chunking

Split content by meaning

ruler

Fixed Size Chunking

Uniform chunk sizes

file-lines

Readers

Configure readers with chunking

magnifying-glass

Search & Retrieval

How chunking affects search