Readers
Convert files, URLs, and text into searchable documents.
Readers transform raw content into Document objects that can be chunked, embedded, and stored in your knowledge base. Each reader handles a specific format (PDF, CSV, Markdown, etc.) and extracts text and metadata.
1from kern.knowledge.reader.pdf_reader import PDFReader23reader = PDFReader(chunk=True, chunk_size=5000)4documents = reader.read("company_handbook.pdf")How Readers Work
- Parse: Read the raw content using format-specific logic
- Extract: Pull out text and metadata (page numbers, authors, etc.)
- Chunk: Split large content into smaller pieces (if enabled)
- Return: Provide a list of
Documentobjects ready for embedding
1# Output structure2Document(3 content="The extracted text...",4 id="unique_id",5 name="document_name",6 meta_data={"page": 1, "source": "handbook.pdf"},7)Supported Readers
| Reader | Description |
|---|---|
PDFReader | Extract text from PDF files |
DoclingReader | Process multiple formats via Docling |
TextReader | Plain text files |
MarkdownReader | Markdown files |
CSVReader | CSV files (rows become documents) |
FieldLabeledCSVReader | CSV rows as field-labeled text |
JSONReader | JSON files |
PPTXReader | PowerPoint presentations |
ArxivReader | Academic papers from arXiv |
WikipediaReader | Wikipedia articles |
YouTubeReader | YouTube transcripts |
WebsiteReader | Crawl websites recursively |
WebSearchReader | Web search results |
FirecrawlReader | Web scraping via Firecrawl API |
LLMsTxtReader | Read llms.txt files |
Using Readers with Knowledge
Pass a reader to knowledge.insert() to override automatic format detection:
1from kern.knowledge.knowledge import Knowledge2from kern.knowledge.reader.pdf_reader import PDFReader34knowledge = Knowledge(vector_db=vector_db)56# Use custom reader configuration7reader = PDFReader(chunk_size=3000, split_on_pages=True)8knowledge.insert(path="documents/", reader=reader)Auto-Selection
Kern automatically selects the right reader based on file extension or URL:
1from kern.knowledge.reader.reader_factory import ReaderFactory23# By file extension4reader = ReaderFactory.get_reader_for_extension(".pdf") # PDFReader5reader = ReaderFactory.get_reader_for_extension(".csv") # CSVReader67# By URL8reader = ReaderFactory.get_reader_for_url("https://youtube.com/watch?v=...") # YouTubeReaderWhen using knowledge.insert(), this happens automatically.
Configuration
Chunking
1reader = PDFReader(2 chunk=True, # Enable chunking (default: True)3 chunk_size=5000, # Characters per chunk4)Format-Specific Options
1# PDF with encryption and OCR2reader = PDFReader(3 password="secret",4 read_images=True, # OCR for images5 split_on_pages=True, # One document per page6)78# CSV with custom encoding9reader = CSVReader(10 encoding="latin-1",11)1213# Text with encoding override14reader = TextReader(15 encoding="utf-8",16)Runtime Options
Override settings when calling read():
1documents = reader.read(2 "file.pdf",3 name="custom_document_name", # Override default naming4 password="runtime_password", # Password at read time5)Async Processing
All readers support async for better performance with I/O operations:
1import asyncio23# Single file4documents = await reader.async_read("file.pdf")56# Batch processing7tasks = [reader.async_read(file) for file in files]8all_documents = await asyncio.gather(*tasks)Custom Chunking Strategy
Override the default chunking behavior:
1from kern.knowledge.chunking.semantic_chunking import SemanticChunking23reader = PDFReader(4 chunk=True,5 chunking_strategy=SemanticChunking(),6)See Chunking for available strategies.
Restricting URL Fetches
By default, a URL-fetching reader will fetch any URL passed to it. Use allowed_hosts to restrict the reader to a fixed hostname allowlist. URLs outside the list are skipped and return no documents. Matching is case-insensitive and applies to the whole hostname, so list every subdomain you want to permit.
1reader = WebsiteReader(allowed_hosts=["kern.ndx.rocks"])WebsiteReader, WebSearchReader, and LLMsTxtReader also re-check the allowlist on each redirect, so an allowed host can't redirect to a blocked one. FirecrawlReader and DoclingReader validate the initial URL only.
Error Handling
Readers return an empty list when processing fails. Check logs for debugging information:
1documents = reader.read("corrupted.pdf")2if not documents:3 print("Failed to read file, check logs for details")