Markdown Chunking

Markdown chunking is a method of splitting documents into smaller chunks of a specified size, with optional overlap between chunks. This is useful when you want to process large documents in smaller, manageable pieces.

Create a Python file

1import asyncio
2from kern.agent import Agent
3from kern.knowledge.chunking.markdown import MarkdownChunking
4from kern.knowledge.knowledge import Knowledge
5from kern.knowledge.reader.markdown_reader import MarkdownReader
6from kern.vectordb.pgvector import PgVector
7
8db_url = "postgresql+psycopg://ai:ai@localhost:5532/ai"
9
10knowledge = Knowledge(
11 vector_db=PgVector(table_name="recipes_markdown_chunking", db_url=db_url),
12)
13
14asyncio.run(knowledge.ainsert(
15 url="https://github.com/kern-ai/kern/blob/main/README.md",
16 reader=MarkdownReader(
17 name="Markdown Chunking Reader",
18 chunking_strategy=MarkdownChunking(),
19 ),
20))
21agent = Agent(
22 knowledge=knowledge,
23 search_knowledge=True,
24)
25
26agent.print_response("What is Kern?", markdown=True)

Set up your virtual environment

1uv venv --python 3.12
2source .venv/bin/activate
1uv venv --python 3.12
2.venv\Scripts\activate

Install dependencies

1uv pip install -U kern-ai sqlalchemy psycopg pgvector

Run PgVector

1docker run -d \
2 -e POSTGRES_DB=ai \
3 -e POSTGRES_USER=ai \
4 -e POSTGRES_PASSWORD=ai \
5 -e PGDATA=/var/lib/postgresql/data/pgdata \
6 -v pgvolume:/var/lib/postgresql/data \
7 -p 5532:5432 \
8 --name pgvector \
9 kern/pgvector:16

Run the script

1python markdown_chunking.py

Markdown Chunking Params

ParameterTypeDefaultDescription
chunk_sizeint5000The maximum size of each chunk.
overlapint0The number of characters to overlap between chunks.