Document Chunking

Document chunking is a method of splitting documents into smaller chunks based on document structure like paragraphs and sections. It analyzes natural document boundaries rather than splitting at fixed character counts. This is useful when you want to process large documents while preserving semantic meaning and context.

Create a Python file

1import asyncio
2from kern.agent import Agent
3from kern.knowledge.chunking.document import DocumentChunking
4from kern.knowledge.knowledge import Knowledge
5from kern.knowledge.reader.pdf_reader import PDFReader
6from kern.vectordb.pgvector import PgVector
7
8db_url = "postgresql+psycopg://ai:ai@localhost:5532/ai"
9
10knowledge = Knowledge(
11    vector_db=PgVector(table_name="recipes_document_chunking", db_url=db_url),
12)
13
14asyncio.run(knowledge.ainsert(
15    url="https://kern-public.s3.amazonaws.com/recipes/ThaiRecipes.pdf",
16    reader=PDFReader(
17        name="Document Chunking Reader",
18        chunking_strategy=DocumentChunking(),
19    ),
20))
21
22agent = Agent(
23    knowledge=knowledge,
24    search_knowledge=True,
25)
26
27agent.print_response("How to make Thai curry?", markdown=True)

Set up your virtual environment

1uv venv --python 3.12
2source .venv/bin/activate

1uv venv --python 3.12
2.venv\Scripts\activate

Install dependencies

1uv pip install -U kern-ai sqlalchemy psycopg pgvector

Run PgVector

1docker run -d \
2  -e POSTGRES_DB=ai \
3  -e POSTGRES_USER=ai \
4  -e POSTGRES_PASSWORD=ai \
5  -e PGDATA=/var/lib/postgresql/data/pgdata \
6  -v pgvolume:/var/lib/postgresql/data \
7  -p 5532:5432 \
8  --name pgvector \
9  kern/pgvector:16

Run the script

1python document_chunking.py

Document Chunking Params

Parameter	Type	Default	Description
`chunk_size`	`int`	`5000`	The maximum size of each chunk.
`overlap`	`int`	`0`	The number of characters to overlap between chunks.