Code Chunking
Code chunking splits code based on its structure, leveraging Abstract Syntax Trees (ASTs) to create contextually relevant segments. It uses the Chonkie library to identify natural code boundaries like functions, classes, and blocks. Learn more about code chunking. This preserves code semantics better than fixed-size chunking by ensuring related code stays together in the same chunk, while splitting occurs at meaningful structural boundaries.
Code chunking supports several built-in tokenizers or a custom Tokenizer instance.
Create a Python file
1from kern.agent import Agent2from kern.knowledge.chunking.code import CodeChunking3from kern.knowledge.knowledge import Knowledge4from kern.knowledge.reader.text_reader import TextReader5from kern.vectordb.pgvector import PgVector67db_url = "postgresql+psycopg://ai:ai@localhost:5532/ai"89knowledge = Knowledge(10 vector_db=PgVector(table_name="python_code_chunking", db_url=db_url),11)1213knowledge.insert(14 url="https://raw.githubusercontent.com/kern-agi/kern/main/libs/kern/kern/session/workflow.py",15 reader=TextReader(16 chunking_strategy=CodeChunking(17 tokenizer="gpt2",18 chunk_size=500,19 language="python",20 ),21 ),22)2324agent = Agent(knowledge=knowledge, search_knowledge=True)25agent.print_response("How does the Workflow class work?", markdown=True)1from typing import Sequence23from kern.agent import Agent4from kern.knowledge.chunking.code import CodeChunking5from kern.knowledge.knowledge import Knowledge6from kern.knowledge.reader.text_reader import TextReader7from kern.vectordb.pgvector import PgVector8from chonkie.tokenizer import Tokenizer910db_url = "postgresql+psycopg://ai:ai@localhost:5532/ai"111213class LineTokenizer(Tokenizer):14 """Custom tokenizer that counts lines of code."""1516 def __init__(self):17 self.vocab = []18 self.token2id = {}1920 def __repr__(self) -> str:21 return f"LineTokenizer(vocab_size={len(self.vocab)})"2223 def tokenize(self, text: str) -> Sequence[str]:24 if not text:25 return []26 return text.split("\n")2728 def encode(self, text: str) -> Sequence[int]:29 encoded = []30 for token in self.tokenize(text):31 if token not in self.token2id:32 self.token2id[token] = len(self.vocab)33 self.vocab.append(token)34 encoded.append(self.token2id[token])35 return encoded3637 def decode(self, tokens: Sequence[int]) -> str:38 try:39 return "\n".join([self.vocab[token] for token in tokens])40 except Exception as e:41 raise ValueError(42 f"Decoding failed. Tokens: {tokens} not found in vocab."43 ) from e4445 def count_tokens(self, text: str) -> int:46 if not text:47 return 048 return len(text.split("\n"))495051knowledge = Knowledge(52 vector_db=PgVector(table_name="code_custom_tokenizer", db_url=db_url),53)5455knowledge.insert(56 url="https://raw.githubusercontent.com/kern-agi/kern/main/libs/kern/kern/session/workflow.py",57 reader=TextReader(58 chunking_strategy=CodeChunking(59 tokenizer=LineTokenizer(),60 chunk_size=500,61 language="python",62 ),63 ),64)6566agent = Agent(knowledge=knowledge, search_knowledge=True)67agent.print_response("How does the Workflow class work?", markdown=True)Set up your virtual environment
1uv venv --python 3.122source .venv/bin/activate1uv venv --python 3.122.venv\Scripts\activateInstall dependencies
1uv pip install -U kern-ai sqlalchemy psycopg pgvector "chonkie[code]" openaiSet OpenAI Key
Set your OPENAI_API_KEY as an environment variable. You can get one from OpenAI.
1export OPENAI_API_KEY=sk-***1setx OPENAI_API_KEY sk-***Run PgVector
1docker run -d \2 -e POSTGRES_DB=ai \3 -e POSTGRES_USER=ai \4 -e POSTGRES_PASSWORD=ai \5 -e PGDATA=/var/lib/postgresql/data/pgdata \6 -v pgvolume:/var/lib/postgresql/data \7 -p 5532:5432 \8 --name pgvector \9 kern/pgvector:16Run the script
1python code_chunking.pyCode Chunking Params
| Parameter | Type | Default | Description |
|---|---|---|---|
tokenizer | Union[str, TokenizerProtocol] | "character" | The tokenizer for measuring chunk sizes. Supports several built-in tokenizers or a custom Tokenizer instance. |
chunk_size | int | 2048 | Maximum size of each chunk in tokens (based on the selected tokenizer). |
language | Union[Literal["auto"], Any] | "auto" | The programming language to parse. Use "auto" for automatic detection or specify a tree-sitter language name (e.g., "python", "javascript", "go", "rust"). |
include_nodes | bool | False | Whether to include AST nodes. Note: Chonkie's base Chunk type does not store node information. |
chunker_params | Optional[Dict[str, Any]] | None | Additional parameters to pass directly to Chonkie's CodeChunker. |