Readers

Convert files, URLs, and text into searchable documents.

Readers transform raw content into Document objects that can be chunked, embedded, and stored in your knowledge base. Each reader handles a specific format (PDF, CSV, Markdown, etc.) and extracts text and metadata.

1from kern.knowledge.reader.pdf_reader import PDFReader
2
3reader = PDFReader(chunk=True, chunk_size=5000)
4documents = reader.read("company_handbook.pdf")

How Readers Work

Parse: Read the raw content using format-specific logic
Extract: Pull out text and metadata (page numbers, authors, etc.)
Chunk: Split large content into smaller pieces (if enabled)
Return: Provide a list of Document objects ready for embedding

1# Output structure
2Document(
3    content="The extracted text...",
4    id="unique_id",
5    name="document_name",
6    meta_data={"page": 1, "source": "handbook.pdf"},
7)

Supported Readers

Reader	Description
`PDFReader`	Extract text from PDF files
`DoclingReader`	Process multiple formats via Docling
`TextReader`	Plain text files
`MarkdownReader`	Markdown files
`CSVReader`	CSV files (rows become documents)
`FieldLabeledCSVReader`	CSV rows as field-labeled text
`JSONReader`	JSON files
`PPTXReader`	PowerPoint presentations
`ArxivReader`	Academic papers from arXiv
`WikipediaReader`	Wikipedia articles
`YouTubeReader`	YouTube transcripts
`WebsiteReader`	Crawl websites recursively
`WebSearchReader`	Web search results
`FirecrawlReader`	Web scraping via Firecrawl API
`LLMsTxtReader`	Read `llms.txt` files

Using Readers with Knowledge

Pass a reader to knowledge.insert() to override automatic format detection:

1from kern.knowledge.knowledge import Knowledge
2from kern.knowledge.reader.pdf_reader import PDFReader
3
4knowledge = Knowledge(vector_db=vector_db)
5
6# Use custom reader configuration
7reader = PDFReader(chunk_size=3000, split_on_pages=True)
8knowledge.insert(path="documents/", reader=reader)

Auto-Selection

Kern automatically selects the right reader based on file extension or URL:

1from kern.knowledge.reader.reader_factory import ReaderFactory
2
3# By file extension
4reader = ReaderFactory.get_reader_for_extension(".pdf")  # PDFReader
5reader = ReaderFactory.get_reader_for_extension(".csv")  # CSVReader
6
7# By URL
8reader = ReaderFactory.get_reader_for_url("https://youtube.com/watch?v=...")  # YouTubeReader

When using knowledge.insert(), this happens automatically.

Configuration

Chunking

1reader = PDFReader(
2    chunk=True,           # Enable chunking (default: True)
3    chunk_size=5000,      # Characters per chunk
4)

Format-Specific Options

1# PDF with encryption and OCR
2reader = PDFReader(
3    password="secret",
4    read_images=True,     # OCR for images
5    split_on_pages=True,  # One document per page
6)
7
8# CSV with custom encoding
9reader = CSVReader(
10    encoding="latin-1",
11)
12
13# Text with encoding override
14reader = TextReader(
15    encoding="utf-8",
16)

Runtime Options

Override settings when calling read():

1documents = reader.read(
2    "file.pdf",
3    name="custom_document_name",  # Override default naming
4    password="runtime_password",  # Password at read time
5)

Async Processing

All readers support async for better performance with I/O operations:

1import asyncio
2
3# Single file
4documents = await reader.async_read("file.pdf")
5
6# Batch processing
7tasks = [reader.async_read(file) for file in files]
8all_documents = await asyncio.gather(*tasks)

Custom Chunking Strategy

Override the default chunking behavior:

1from kern.knowledge.chunking.semantic_chunking import SemanticChunking
2
3reader = PDFReader(
4    chunk=True,
5    chunking_strategy=SemanticChunking(),
6)

See Chunking for available strategies.

Restricting URL Fetches

By default, a URL-fetching reader will fetch any URL passed to it. Use allowed_hosts to restrict the reader to a fixed hostname allowlist. URLs outside the list are skipped and return no documents. Matching is case-insensitive and applies to the whole hostname, so list every subdomain you want to permit.

1reader = WebsiteReader(allowed_hosts=["kern.ndx.rocks"])

WebsiteReader, WebSearchReader, and LLMsTxtReader also re-check the allowlist on each redirect, so an allowed host can't redirect to a blocked one. FirecrawlReader and DoclingReader validate the initial URL only.

Error Handling

Readers return an empty list when processing fails. Check logs for debugging information:

1documents = reader.read("corrupted.pdf")
2if not documents:
3    print("Failed to read file, check logs for details")

Next Steps

file-pdf

PDF Reader

Extract text from PDFs

globe

Website Reader

Crawl and index websites

scissors

Chunking

Control how content is split

database

Vector DB

Store processed documents