ScrapeGraph

ScrapeGraphTools enable an Agent to extract structured data from webpages, convert content to markdown, and retrieve raw HTML content.

ScrapeGraphTools enable an Agent to extract structured data from webpages, convert content to markdown, and retrieve raw HTML content using the ScrapeGraphAI API.

The toolkit provides 5 core capabilities:

smartscraper: Extract structured data using natural language prompts
markdownify: Convert web pages to markdown format
searchscraper: Search the web and extract information
crawl: Crawl websites with structured data extraction
scrape: Get raw HTML content from websites (NEW!)

The scrape method is particularly useful when you need:

Complete HTML source code
Raw content for further processing
HTML structure analysis
Content that needs to be parsed differently

All methods support heavy JavaScript rendering when needed.

Prerequisites

The following examples require the scrapegraph-py library.

1uv pip install -U scrapegraph-py

Optionally, if your ScrapeGraph configuration or specific models require an API key, set the SGAI_API_KEY environment variable:

1export SGAI_API_KEY="YOUR_SGAI_API_KEY"

Example

The following agent will extract structured data from a website using the smartscraper tool:

1from kern.agent import Agent
2from kern.models.openai import OpenAIResponses
3from kern.tools.scrapegraph import ScrapeGraphTools
4
5agent_model = OpenAIResponses(id="gpt-5.2")
6scrapegraph_smartscraper = ScrapeGraphTools(enable_smartscraper=True)
7
8agent = Agent(
9    tools=[scrapegraph_smartscraper], model=agent_model, markdown=True, stream=True
10)
11
12agent.print_response("""
13Use smartscraper to extract the following from https://www.wired.com/category/science/:
14- News articles
15- Headlines
16- Images
17- Links
18- Author
19""")

Raw HTML Scraping

Get complete HTML content from websites for custom processing:

1# Enable scrape method for raw HTML content
2scrapegraph_scrape = ScrapeGraphTools(enable_scrape=True, enable_smartscraper=False)
3
4scrape_agent = Agent(
5    tools=[scrapegraph_scrape],
6    model=agent_model,
7    markdown=True,
8    stream=True,
9)
10
11scrape_agent.print_response(
12    "Use the scrape tool to get the complete raw HTML content from https://en.wikipedia.org/wiki/2025_FIFA_Club_World_Cup"
13)

All Functions with JavaScript Rendering

Enable all ScrapeGraph functions with heavy JavaScript support:

1# Enable all ScrapeGraph functions
2scrapegraph_all = Agent(
3    tools=[
4        ScrapeGraphTools(all=True, render_heavy_js=True)
5    ],  # render_heavy_js=True scrapes all JavaScript
6    model=agent_model,
7    markdown=True,
8    stream=True,
9)
10
11scrapegraph_all.print_response("""
12Use any appropriate scraping method to extract comprehensive information from https://www.wired.com/category/science/:
13- News articles and headlines
14- Convert to markdown if needed
15- Search for specific information
16""")

NoteView the Startup Analyst example

Toolkit Params

Parameter	Type	Default	Description
`api_key`	`Optional[str]`	`None`	ScrapeGraph API key. If not provided, uses SGAI_API_KEY environment variable.
`enable_smartscraper`	`bool`	`True`	Enable the smartscraper function for LLM-powered data extraction.
`enable_markdownify`	`bool`	`False`	Enable the markdownify function for webpage to markdown conversion.
`enable_crawl`	`bool`	`False`	Enable the crawl function for website crawling and data extraction.
`enable_searchscraper`	`bool`	`False`	Enable the searchscraper function for web search and information extraction.
`enable_agentic_crawler`	`bool`	`False`	Enable the agentic_crawler function for automated browser actions and AI extraction.
`enable_scrape`	`bool`	`False`	Enable the scrape function for retrieving raw HTML content from websites.
`render_heavy_js`	`bool`	`False`	Enable heavy JavaScript rendering for all scraping functions. Useful for SPAs and dynamic content.
`all`	`bool`	`False`	Enable all available functions. When True, all enable flags are ignored.

Toolkit Functions

Function	Description
`smartscraper`	Extract structured data from a webpage using LLM and natural language prompt. Parameters: url (str), prompt (str).
`markdownify`	Convert a webpage to markdown format. Parameters: url (str).
`crawl`	Crawl a website and extract structured data. Parameters: url (str), prompt (str), data_schema (dict), cache_website (bool), depth (int), max_pages (int), same_domain_only (bool), batch_size (int).
`searchscraper`	Search the web and extract information. Parameters: user_prompt (str).
`agentic_crawler`	Perform automated browser actions with optional AI extraction. Parameters: url (str), steps (List[str]), use_session (bool), user_prompt (Optional[str]), output_schema (Optional[dict]), ai_extraction (bool).
`scrape`	Get raw HTML content from a website. Useful for complete source code retrieval and custom processing. Parameters: website_url (str), headers (Optional[dict]).

Developer Resources

View Tools
View Tests