Document Chunking
Document chunking is a method of splitting documents into smaller chunks based on document structure like paragraphs and sections. It analyzes natural document boundaries rather than splitting at fixed character counts. This is useful when you want to process large documents while preserving semantic meaning and context.
Create a Python file
1import asyncio2from kern.agent import Agent3from kern.knowledge.chunking.document import DocumentChunking4from kern.knowledge.knowledge import Knowledge5from kern.knowledge.reader.pdf_reader import PDFReader6from kern.vectordb.pgvector import PgVector78db_url = "postgresql+psycopg://ai:ai@localhost:5532/ai"910knowledge = Knowledge(11 vector_db=PgVector(table_name="recipes_document_chunking", db_url=db_url),12)1314asyncio.run(knowledge.ainsert(15 url="https://kern-public.s3.amazonaws.com/recipes/ThaiRecipes.pdf",16 reader=PDFReader(17 name="Document Chunking Reader",18 chunking_strategy=DocumentChunking(),19 ),20))2122agent = Agent(23 knowledge=knowledge,24 search_knowledge=True,25)2627agent.print_response("How to make Thai curry?", markdown=True)Set up your virtual environment
1uv venv --python 3.122source .venv/bin/activate1uv venv --python 3.122.venv\Scripts\activateInstall dependencies
1uv pip install -U kern-ai sqlalchemy psycopg pgvectorRun PgVector
1docker run -d \2 -e POSTGRES_DB=ai \3 -e POSTGRES_USER=ai \4 -e POSTGRES_PASSWORD=ai \5 -e PGDATA=/var/lib/postgresql/data/pgdata \6 -v pgvolume:/var/lib/postgresql/data \7 -p 5532:5432 \8 --name pgvector \9 kern/pgvector:16Run the script
1python document_chunking.pyDocument Chunking Params
| Parameter | Type | Default | Description |
|---|---|---|---|
chunk_size | int | 5000 | The maximum size of each chunk. |
overlap | int | 0 | The number of characters to overlap between chunks. |