Readers

Convert files, URLs, and text into searchable documents.

Readers transform raw content into Document objects that can be chunked, embedded, and stored in your knowledge base. Each reader handles a specific format (PDF, CSV, Markdown, etc.) and extracts text and metadata.

1from kern.knowledge.reader.pdf_reader import PDFReader
2
3reader = PDFReader(chunk=True, chunk_size=5000)
4documents = reader.read("company_handbook.pdf")

How Readers Work

  1. Parse: Read the raw content using format-specific logic
  2. Extract: Pull out text and metadata (page numbers, authors, etc.)
  3. Chunk: Split large content into smaller pieces (if enabled)
  4. Return: Provide a list of Document objects ready for embedding
1# Output structure
2Document(
3 content="The extracted text...",
4 id="unique_id",
5 name="document_name",
6 meta_data={"page": 1, "source": "handbook.pdf"},
7)

Supported Readers

ReaderDescription
PDFReaderExtract text from PDF files
DoclingReaderProcess multiple formats via Docling
TextReaderPlain text files
MarkdownReaderMarkdown files
CSVReaderCSV files (rows become documents)
FieldLabeledCSVReaderCSV rows as field-labeled text
JSONReaderJSON files
PPTXReaderPowerPoint presentations
ArxivReaderAcademic papers from arXiv
WikipediaReaderWikipedia articles
YouTubeReaderYouTube transcripts
WebsiteReaderCrawl websites recursively
WebSearchReaderWeb search results
FirecrawlReaderWeb scraping via Firecrawl API
LLMsTxtReaderRead llms.txt files

Using Readers with Knowledge

Pass a reader to knowledge.insert() to override automatic format detection:

1from kern.knowledge.knowledge import Knowledge
2from kern.knowledge.reader.pdf_reader import PDFReader
3
4knowledge = Knowledge(vector_db=vector_db)
5
6# Use custom reader configuration
7reader = PDFReader(chunk_size=3000, split_on_pages=True)
8knowledge.insert(path="documents/", reader=reader)

Auto-Selection

Kern automatically selects the right reader based on file extension or URL:

1from kern.knowledge.reader.reader_factory import ReaderFactory
2
3# By file extension
4reader = ReaderFactory.get_reader_for_extension(".pdf") # PDFReader
5reader = ReaderFactory.get_reader_for_extension(".csv") # CSVReader
6
7# By URL
8reader = ReaderFactory.get_reader_for_url("https://youtube.com/watch?v=...") # YouTubeReader

When using knowledge.insert(), this happens automatically.

Configuration

Chunking

1reader = PDFReader(
2 chunk=True, # Enable chunking (default: True)
3 chunk_size=5000, # Characters per chunk
4)

Format-Specific Options

1# PDF with encryption and OCR
2reader = PDFReader(
3 password="secret",
4 read_images=True, # OCR for images
5 split_on_pages=True, # One document per page
6)
7
8# CSV with custom encoding
9reader = CSVReader(
10 encoding="latin-1",
11)
12
13# Text with encoding override
14reader = TextReader(
15 encoding="utf-8",
16)

Runtime Options

Override settings when calling read():

1documents = reader.read(
2 "file.pdf",
3 name="custom_document_name", # Override default naming
4 password="runtime_password", # Password at read time
5)

Async Processing

All readers support async for better performance with I/O operations:

1import asyncio
2
3# Single file
4documents = await reader.async_read("file.pdf")
5
6# Batch processing
7tasks = [reader.async_read(file) for file in files]
8all_documents = await asyncio.gather(*tasks)

Custom Chunking Strategy

Override the default chunking behavior:

1from kern.knowledge.chunking.semantic_chunking import SemanticChunking
2
3reader = PDFReader(
4 chunk=True,
5 chunking_strategy=SemanticChunking(),
6)

See Chunking for available strategies.

Restricting URL Fetches

By default, a URL-fetching reader will fetch any URL passed to it. Use allowed_hosts to restrict the reader to a fixed hostname allowlist. URLs outside the list are skipped and return no documents. Matching is case-insensitive and applies to the whole hostname, so list every subdomain you want to permit.

1reader = WebsiteReader(allowed_hosts=["kern.ndx.rocks"])

WebsiteReader, WebSearchReader, and LLMsTxtReader also re-check the allowlist on each redirect, so an allowed host can't redirect to a blocked one. FirecrawlReader and DoclingReader validate the initial URL only.

Error Handling

Readers return an empty list when processing fails. Check logs for debugging information:

1documents = reader.read("corrupted.pdf")
2if not documents:
3 print("Failed to read file, check logs for details")

Next Steps