CSV Row Chunking
CSV row chunking is a method of splitting CSV files into smaller chunks based on the number of rows, rather than character count. This approach is particularly useful for structured data where you want to process CSV files in manageable row-based chunks while preserving the integrity of individual records.
Create a Python file
1import asyncio2from kern.agent import Agent3from kern.knowledge.chunking.row import RowChunking4from kern.knowledge.knowledge import Knowledge5from kern.knowledge.reader.csv_reader import CSVReader6from kern.vectordb.pgvector import PgVector78db_url = "postgresql+psycopg://ai:ai@localhost:5532/ai"910knowledge_base = Knowledge(11 vector_db=PgVector(table_name="imdb_movies_row_chunking", db_url=db_url),12)1314asyncio.run(knowledge_base.ainsert(15 url="https://kern-public.s3.amazonaws.com/demo_data/IMDB-Movie-Data.csv",16 reader=CSVReader(17 chunking_strategy=RowChunking(),18 ),19))2021# Initialize the Agent with the knowledge_base22agent = Agent(23 knowledge=knowledge_base,24 search_knowledge=True,25)2627# Use the agent28agent.print_response("Tell me about the movie Guardians of the Galaxy", markdown=True)Set up your virtual environment
1uv venv --python 3.122source .venv/bin/activate1uv venv --python 3.122.venv\Scripts\activateInstall dependencies
1uv pip install -U kern-ai sqlalchemy psycopg pgvectorRun PgVector
1docker run -d \2 -e POSTGRES_DB=ai \3 -e POSTGRES_USER=ai \4 -e POSTGRES_PASSWORD=ai \5 -e PGDATA=/var/lib/postgresql/data/pgdata \6 -v pgvolume:/var/lib/postgresql/data \7 -p 5532:5432 \8 --name pgvector \9 kern/pgvector:16Run the script
1python csv_row_chunking.pyCSV Row Chunking Params
| Parameter | Type | Default | Description |
|---|---|---|---|
rows_per_chunk | int | 100 | The number of rows to include in each chunk. |
skip_header | bool | False | Whether to skip the header row when chunking. |
clean_rows | bool | True | Whether to clean and normalize row data. |
include_header_in_chunks | bool | False | Whether to include the header row in each chunk. |
max_chunk_size | int | 5000 | Maximum character size for each chunk (fallback limit). |