ScrapeGraph

ScrapeGraphTools enable an Agent to extract structured data from webpages, convert content to markdown, and retrieve raw HTML content.

ScrapeGraphTools enable an Agent to extract structured data from webpages, convert content to markdown, and retrieve raw HTML content using the ScrapeGraphAI API.

The toolkit provides 5 core capabilities:

  1. smartscraper: Extract structured data using natural language prompts
  2. markdownify: Convert web pages to markdown format
  3. searchscraper: Search the web and extract information
  4. crawl: Crawl websites with structured data extraction
  5. scrape: Get raw HTML content from websites (NEW!)

The scrape method is particularly useful when you need:

  • Complete HTML source code
  • Raw content for further processing
  • HTML structure analysis
  • Content that needs to be parsed differently

All methods support heavy JavaScript rendering when needed.

Prerequisites

The following examples require the scrapegraph-py library.

1uv pip install -U scrapegraph-py

Optionally, if your ScrapeGraph configuration or specific models require an API key, set the SGAI_API_KEY environment variable:

1export SGAI_API_KEY="YOUR_SGAI_API_KEY"

Example

The following agent will extract structured data from a website using the smartscraper tool:

1from kern.agent import Agent
2from kern.models.openai import OpenAIResponses
3from kern.tools.scrapegraph import ScrapeGraphTools
4
5agent_model = OpenAIResponses(id="gpt-5.2")
6scrapegraph_smartscraper = ScrapeGraphTools(enable_smartscraper=True)
7
8agent = Agent(
9 tools=[scrapegraph_smartscraper], model=agent_model, markdown=True, stream=True
10)
11
12agent.print_response("""
13Use smartscraper to extract the following from https://www.wired.com/category/science/:
14- News articles
15- Headlines
16- Images
17- Links
18- Author
19""")

Raw HTML Scraping

Get complete HTML content from websites for custom processing:

1# Enable scrape method for raw HTML content
2scrapegraph_scrape = ScrapeGraphTools(enable_scrape=True, enable_smartscraper=False)
3
4scrape_agent = Agent(
5 tools=[scrapegraph_scrape],
6 model=agent_model,
7 markdown=True,
8 stream=True,
9)
10
11scrape_agent.print_response(
12 "Use the scrape tool to get the complete raw HTML content from https://en.wikipedia.org/wiki/2025_FIFA_Club_World_Cup"
13)

All Functions with JavaScript Rendering

Enable all ScrapeGraph functions with heavy JavaScript support:

1# Enable all ScrapeGraph functions
2scrapegraph_all = Agent(
3 tools=[
4 ScrapeGraphTools(all=True, render_heavy_js=True)
5 ], # render_heavy_js=True scrapes all JavaScript
6 model=agent_model,
7 markdown=True,
8 stream=True,
9)
10
11scrapegraph_all.print_response("""
12Use any appropriate scraping method to extract comprehensive information from https://www.wired.com/category/science/:
13- News articles and headlines
14- Convert to markdown if needed
15- Search for specific information
16""")

Toolkit Params

ParameterTypeDefaultDescription
api_keyOptional[str]NoneScrapeGraph API key. If not provided, uses SGAI_API_KEY environment variable.
enable_smartscraperboolTrueEnable the smartscraper function for LLM-powered data extraction.
enable_markdownifyboolFalseEnable the markdownify function for webpage to markdown conversion.
enable_crawlboolFalseEnable the crawl function for website crawling and data extraction.
enable_searchscraperboolFalseEnable the searchscraper function for web search and information extraction.
enable_agentic_crawlerboolFalseEnable the agentic_crawler function for automated browser actions and AI extraction.
enable_scrapeboolFalseEnable the scrape function for retrieving raw HTML content from websites.
render_heavy_jsboolFalseEnable heavy JavaScript rendering for all scraping functions. Useful for SPAs and dynamic content.
allboolFalseEnable all available functions. When True, all enable flags are ignored.

Toolkit Functions

FunctionDescription
smartscraperExtract structured data from a webpage using LLM and natural language prompt. Parameters: url (str), prompt (str).
markdownifyConvert a webpage to markdown format. Parameters: url (str).
crawlCrawl a website and extract structured data. Parameters: url (str), prompt (str), data_schema (dict), cache_website (bool), depth (int), max_pages (int), same_domain_only (bool), batch_size (int).
searchscraperSearch the web and extract information. Parameters: user_prompt (str).
agentic_crawlerPerform automated browser actions with optional AI extraction. Parameters: url (str), steps (List[str]), use_session (bool), user_prompt (Optional[str]), output_schema (Optional[dict]), ai_extraction (bool).
scrapeGet raw HTML content from a website. Useful for complete source code retrieval and custom processing. Parameters: website_url (str), headers (Optional[dict]).

Developer Resources