Semantic Chunking
Semantic chunking is a method of splitting documents into smaller chunks by analyzing semantic similarity between text segments using embeddings. It uses the Chonkie library to identify natural breakpoints where the semantic meaning changes significantly, based on a configurable similarity threshold. Learn more about semantic chunking. This helps preserve context and meaning better than fixed-size chunking by ensuring semantically related content stays together in the same chunk, while splitting occurs at meaningful topic transitions.
Semantic chunking supports three embedder configurations: Kern Embeddings uses an Kern Embedder, Chonkie Embeddings uses Chonkie's built-in embeddings handlers, and AutoEmbeddings uses Chonkie's AutoEmbeddings for automatic selection based on the model string. Learn more about Chonkie embeddings.
Create a Python file
1from kern.agent import Agent2from kern.knowledge.chunking.semantic import SemanticChunking3from kern.knowledge.embedder.openai import OpenAIEmbedder4from kern.knowledge.knowledge import Knowledge5from kern.knowledge.reader.pdf_reader import PDFReader6from kern.vectordb.pgvector import PgVector78db_url = "postgresql+psycopg://ai:ai@localhost:5532/ai"910embedder = OpenAIEmbedder(id="text-embedding-3-small")1112knowledge = Knowledge(13 vector_db=PgVector(14 table_name="recipes_semantic_chunking", db_url=db_url, embedder=embedder15 ),16)17knowledge.insert(18 url="https://kern-public.s3.amazonaws.com/recipes/ThaiRecipes.pdf",19 reader=PDFReader(20 name="Semantic Chunking Reader",21 chunking_strategy=SemanticChunking(22 embedder=embedder, # Use same Kern embedder for chunking23 chunk_size=500,24 similarity_threshold=0.5,25 similarity_window=3,26 min_sentences_per_chunk=1,27 min_characters_per_sentence=24,28 delimiters=[". ", "! ", "? ", "\n"],29 include_delimiters="prev",30 skip_window=0,31 filter_window=5,32 filter_polyorder=3,33 filter_tolerance=0.2,34 ),35 ),36)3738agent = Agent(39 knowledge=knowledge,40 search_knowledge=True,41)4243agent.print_response("How to make Thai curry?", markdown=True)1from kern.agent import Agent2from kern.knowledge.chunking.semantic import SemanticChunking3from kern.knowledge.embedder.openai import OpenAIEmbedder4from kern.knowledge.knowledge import Knowledge5from kern.knowledge.reader.pdf_reader import PDFReader6from kern.vectordb.pgvector import PgVector7from chonkie.embeddings import OpenAIEmbeddings89db_url = "postgresql+psycopg://ai:ai@localhost:5532/ai"1011agno_embedder = OpenAIEmbedder(id="text-embedding-3-small") # For vector database12chonkie_embedder = OpenAIEmbeddings(13 model="text-embedding-3-small"14) # For semantic chunking1516knowledge = Knowledge(17 vector_db=PgVector(18 table_name="recipes_semantic_chunking", db_url=db_url, embedder=agno_embedder19 ),20)21knowledge.insert(22 url="https://kern-public.s3.amazonaws.com/recipes/ThaiRecipes.pdf",23 reader=PDFReader(24 name="Semantic Chunking Reader",25 chunking_strategy=SemanticChunking(26 embedder=chonkie_embedder, # Use Chonkie embedder for chunking27 chunk_size=500,28 similarity_threshold=0.5,29 similarity_window=3,30 min_sentences_per_chunk=1,31 min_characters_per_sentence=24,32 delimiters=[". ", "! ", "? ", "\n"],33 include_delimiters="prev",34 skip_window=0,35 filter_window=5,36 filter_polyorder=3,37 filter_tolerance=0.2,38 ),39 ),40)4142agent = Agent(43 knowledge=knowledge,44 search_knowledge=True,45)4647agent.print_response("How to make Thai curry?", markdown=True)1from kern.agent import Agent2from kern.knowledge.chunking.semantic import SemanticChunking3from kern.knowledge.knowledge import Knowledge4from kern.knowledge.reader.pdf_reader import PDFReader5from kern.vectordb.pgvector import PgVector67db_url = "postgresql+psycopg://ai:ai@localhost:5532/ai"89knowledge = Knowledge(10 vector_db=PgVector(table_name="recipes_semantic_chunking", db_url=db_url),11)12knowledge.insert(13 url="https://kern-public.s3.amazonaws.com/recipes/ThaiRecipes.pdf",14 reader=PDFReader(15 name="Semantic Chunking Reader",16 chunking_strategy=SemanticChunking(17 embedder="text-embedding-3-small", # String model ID uses Chonkie's AutoEmbeddings18 chunk_size=500,19 similarity_threshold=0.5,20 similarity_window=3,21 min_sentences_per_chunk=1,22 min_characters_per_sentence=24,23 delimiters=[". ", "! ", "? ", "\n"],24 include_delimiters="prev",25 skip_window=0,26 filter_window=5,27 filter_polyorder=3,28 filter_tolerance=0.2,29 ),30 ),31)3233agent = Agent(34 knowledge=knowledge,35 search_knowledge=True,36)3738agent.print_response("How to make Thai curry?", markdown=True)Set up your virtual environment
1uv venv --python 3.122source .venv/bin/activate1uv venv --python 3.122.venv\Scripts\activateInstall dependencies
1uv pip install -U kern-ai sqlalchemy psycopg pgvector chonkie openaiSet OpenAI Key
Set your OPENAI_API_KEY as an environment variable. You can get one from OpenAI.
1export OPENAI_API_KEY=sk-***1setx OPENAI_API_KEY sk-***Run PgVector
1docker run -d \2 -e POSTGRES_DB=ai \3 -e POSTGRES_USER=ai \4 -e POSTGRES_PASSWORD=ai \5 -e PGDATA=/var/lib/postgresql/data/pgdata \6 -v pgvolume:/var/lib/postgresql/data \7 -p 5532:5432 \8 --name pgvector \9 kern/pgvector:16Run the script
1python semantic_chunking.pySemantic Chunking Params
| Parameter | Type | Default | Description |
|---|---|---|---|
embedder | Union[str, Embedder, BaseEmbeddings] | OpenAIEmbedder | The embedder configuration. Can be an Kern Embedder (e.g., OpenAIEmbedder, GeminiEmbedder), a Chonkie BaseEmbeddings instance (e.g., OpenAIEmbeddings), or a string model identifier (e.g., "text-embedding-3-small") for Chonkie's AutoEmbeddings. |
chunk_size | int | 5000 | Maximum tokens allowed per chunk. |
similarity_threshold | float | 0.5 | Similarity threshold for grouping sentences (0-1). Lower values create larger groups (fewer chunks). |
similarity_window | int | 3 | Number of sentences to consider for similarity calculation. |
min_sentences_per_chunk | int | 1 | Minimum number of sentences per chunk. |
min_characters_per_sentence | int | 24 | Minimum number of characters per sentence. |
delimiters | List[str] | [". ", "! ", "? ", "\n"] | Delimiters to split sentences on. |
include_delimiters | Literal["prev", "next", None] | "prev" | Include delimiters in the chunk text. Specify whether to include with the previous or next sentence. |
skip_window | int | 0 | Number of groups to skip when looking for similar content to merge. 0 (default) uses standard semantic grouping; higher values enable merging of non-consecutive semantically similar groups. |
filter_window | int | 5 | Window length for the Savitzky-Golay filter used in boundary detection. |
filter_polyorder | int | 3 | Polynomial order for the Savitzky-Golay filter. |
filter_tolerance | float | 0.2 | Tolerance for the Savitzky-Golay filter boundary detection. |
chunker_params | Dict[str, Any] | None | Additional parameters to pass directly to Chonkie's SemanticChunker. |