Web Extraction Agent

Build an AI agent that transforms unstructured web content into organized, structured data by combining Firecrawl's web scraping with Pydantic's structured output validation.

What You'll Learn

By building this agent, you'll understand:

  • How to integrate Firecrawl for reliable web scraping and content extraction
  • How to define structured output schemas using Pydantic models
  • How to create nested data structures for complex web content
  • How to handle optional fields and varied page structures

Use Cases

Build competitive intelligence tools, content aggregation systems, knowledge base constructors, or automated documentation generators.

How It Works

The agent extracts structured data from web pages in a systematic process:

  1. Fetch: Uses Firecrawl to retrieve and parse the target webpage
  2. Analyze: Identifies key sections, elements, and hierarchical structure
  3. Extract: Pulls information according to the Pydantic output schema
  4. Structure: Organizes content into nested models (sections, metadata, links, contact info)

The Pydantic schema ensures consistent output format regardless of the source website's structure, with optional fields handling varied page layouts gracefully.

Code

1from textwrap import dedent
2from typing import Dict, List, Optional
3
4from kern.agent import Agent
5from kern.models.openai import OpenAIResponses
6from kern.tools.firecrawl import FirecrawlTools
7from pydantic import BaseModel, Field
8from rich.pretty import pprint
9
10
11class ContentSection(BaseModel):
12 """Represents a section of content from the webpage."""
13
14 heading: Optional[str] = Field(None, description="Section heading")
15 content: str = Field(..., description="Section content text")
16
17
18class PageInformation(BaseModel):
19 """Structured representation of a webpage."""
20
21 url: str = Field(..., description="URL of the page")
22 title: str = Field(..., description="Title of the page")
23 description: Optional[str] = Field(
24 None, description="Meta description or summary of the page"
25 )
26 features: Optional[List[str]] = Field(None, description="Key feature list")
27 content_sections: Optional[List[ContentSection]] = Field(
28 None, description="Main content sections of the page"
29 )
30 links: Optional[Dict[str, str]] = Field(
31 None, description="Important links found on the page with description"
32 )
33 contact_info: Optional[Dict[str, str]] = Field(
34 None, description="Contact information if available"
35 )
36 metadata: Optional[Dict[str, str]] = Field(
37 None, description="Important metadata from the page"
38 )
39
40
41agent = Agent(
42 model=OpenAIResponses(id="gpt-5.2"),
43 tools=[FirecrawlTools(enable_scrape=True, enable_crawl=True)],
44 instructions=dedent("""
45 You are an expert web researcher and content extractor. Extract comprehensive, structured information
46 from the provided webpage. Focus on:
47
48 1. Accurately capturing the page title, description, and key features
49 2. Identifying and extracting main content sections with their headings
50 3. Finding important links to related pages or resources
51 4. Locating contact information if available
52 5. Extracting relevant metadata that provides context about the site
53
54 Be thorough but concise. If the page has extensive content, prioritize the most important information.
55 """).strip(),
56 output_schema=PageInformation,
57)
58
59result = agent.run("Extract all information from https://www.kern.ndx.rocks")
60pprint(result.content)

What to Expect

The agent will scrape the target URL using Firecrawl and extract all information into a structured PageInformation object. The output includes the page title, description, features, organized content sections with headings, important links, contact information, and additional metadata.

The structured output ensures consistency and makes the extracted data easy to process, store, or display programmatically. Optional fields handle pages with varying structures gracefully.

Usage

Set up your virtual environment

1uv venv --python 3.12
2source .venv/bin/activate
1uv venv --python 3.12
2.venv\Scripts\activate

Set your API key

1export OPENAI_API_KEY=xxx
2export FIRECRAWL_API_KEY=xxx

Install dependencies

1uv pip install -U kern-ai openai firecrawl-py

Run Agent

1python web_extraction_agent.py
1python web_extraction_agent.py

Next Steps

  • Change the target URL to extract data from different websites
  • Modify the PageInformation Pydantic model to capture additional fields
  • Adjust the agent's instructions to focus on specific content types
  • Explore Firecrawl Tools for advanced scraping options