Markdown Chunking

Markdown chunking is a method of splitting documents into smaller chunks of a specified size, with optional overlap between chunks. This is useful when you want to process large documents in smaller, manageable pieces.

Create a Python file

1import asyncio
2from kern.agent import Agent
3from kern.knowledge.chunking.markdown import MarkdownChunking
4from kern.knowledge.knowledge import Knowledge
5from kern.knowledge.reader.markdown_reader import MarkdownReader
6from kern.vectordb.pgvector import PgVector
7
8db_url = "postgresql+psycopg://ai:ai@localhost:5532/ai"
9
10knowledge = Knowledge(
11    vector_db=PgVector(table_name="recipes_markdown_chunking", db_url=db_url),
12)
13
14asyncio.run(knowledge.ainsert(
15    url="https://github.com/kern-ai/kern/blob/main/README.md",
16    reader=MarkdownReader(
17        name="Markdown Chunking Reader",
18        chunking_strategy=MarkdownChunking(),
19    ),
20))
21agent = Agent(
22    knowledge=knowledge,
23    search_knowledge=True,
24)
25
26agent.print_response("What is Kern?", markdown=True)

Set up your virtual environment

1uv venv --python 3.12
2source .venv/bin/activate

1uv venv --python 3.12
2.venv\Scripts\activate

Install dependencies

1uv pip install -U kern-ai sqlalchemy psycopg pgvector

Run PgVector

1docker run -d \
2  -e POSTGRES_DB=ai \
3  -e POSTGRES_USER=ai \
4  -e POSTGRES_PASSWORD=ai \
5  -e PGDATA=/var/lib/postgresql/data/pgdata \
6  -v pgvolume:/var/lib/postgresql/data \
7  -p 5532:5432 \
8  --name pgvector \
9  kern/pgvector:16

Run the script

1python markdown_chunking.py

Markdown Chunking Params

Parameter	Type	Default	Description
`chunk_size`	`int`	`5000`	The maximum size of each chunk.
`overlap`	`int`	`0`	The number of characters to overlap between chunks.