CSV Row Chunking

CSV row chunking is a method of splitting CSV files into smaller chunks based on the number of rows, rather than character count. This approach is particularly useful for structured data where you want to process CSV files in manageable row-based chunks while preserving the integrity of individual records.

Create a Python file

1import asyncio
2from kern.agent import Agent
3from kern.knowledge.chunking.row import RowChunking
4from kern.knowledge.knowledge import Knowledge
5from kern.knowledge.reader.csv_reader import CSVReader
6from kern.vectordb.pgvector import PgVector
7
8db_url = "postgresql+psycopg://ai:ai@localhost:5532/ai"
9
10knowledge_base = Knowledge(
11 vector_db=PgVector(table_name="imdb_movies_row_chunking", db_url=db_url),
12)
13
14asyncio.run(knowledge_base.ainsert(
15 url="https://kern-public.s3.amazonaws.com/demo_data/IMDB-Movie-Data.csv",
16 reader=CSVReader(
17 chunking_strategy=RowChunking(),
18 ),
19))
20
21# Initialize the Agent with the knowledge_base
22agent = Agent(
23 knowledge=knowledge_base,
24 search_knowledge=True,
25)
26
27# Use the agent
28agent.print_response("Tell me about the movie Guardians of the Galaxy", markdown=True)

Set up your virtual environment

1uv venv --python 3.12
2source .venv/bin/activate
1uv venv --python 3.12
2.venv\Scripts\activate

Install dependencies

1uv pip install -U kern-ai sqlalchemy psycopg pgvector

Run PgVector

1docker run -d \
2 -e POSTGRES_DB=ai \
3 -e POSTGRES_USER=ai \
4 -e POSTGRES_PASSWORD=ai \
5 -e PGDATA=/var/lib/postgresql/data/pgdata \
6 -v pgvolume:/var/lib/postgresql/data \
7 -p 5532:5432 \
8 --name pgvector \
9 kern/pgvector:16

Run the script

1python csv_row_chunking.py

CSV Row Chunking Params

ParameterTypeDefaultDescription
rows_per_chunkint100The number of rows to include in each chunk.
skip_headerboolFalseWhether to skip the header row when chunking.
clean_rowsboolTrueWhether to clean and normalize row data.
include_header_in_chunksboolFalseWhether to include the header row in each chunk.
max_chunk_sizeint5000Maximum character size for each chunk (fallback limit).