ScrapeGraph
ScrapeGraphTools enable an Agent to extract structured data from webpages, convert content to markdown, and retrieve raw HTML content.
ScrapeGraphTools enable an Agent to extract structured data from webpages, convert content to markdown, and retrieve raw HTML content using the ScrapeGraphAI API.
The toolkit provides 5 core capabilities:
- smartscraper: Extract structured data using natural language prompts
- markdownify: Convert web pages to markdown format
- searchscraper: Search the web and extract information
- crawl: Crawl websites with structured data extraction
- scrape: Get raw HTML content from websites (NEW!)
The scrape method is particularly useful when you need:
- Complete HTML source code
- Raw content for further processing
- HTML structure analysis
- Content that needs to be parsed differently
All methods support heavy JavaScript rendering when needed.
Prerequisites
The following examples require the scrapegraph-py library.
1uv pip install -U scrapegraph-pyOptionally, if your ScrapeGraph configuration or specific models require an API key, set the SGAI_API_KEY environment variable:
1export SGAI_API_KEY="YOUR_SGAI_API_KEY"Example
The following agent will extract structured data from a website using the smartscraper tool:
1from kern.agent import Agent2from kern.models.openai import OpenAIResponses3from kern.tools.scrapegraph import ScrapeGraphTools45agent_model = OpenAIResponses(id="gpt-5.2")6scrapegraph_smartscraper = ScrapeGraphTools(enable_smartscraper=True)78agent = Agent(9 tools=[scrapegraph_smartscraper], model=agent_model, markdown=True, stream=True10)1112agent.print_response("""13Use smartscraper to extract the following from https://www.wired.com/category/science/:14- News articles15- Headlines16- Images17- Links18- Author19""")Raw HTML Scraping
Get complete HTML content from websites for custom processing:
1# Enable scrape method for raw HTML content2scrapegraph_scrape = ScrapeGraphTools(enable_scrape=True, enable_smartscraper=False)34scrape_agent = Agent(5 tools=[scrapegraph_scrape],6 model=agent_model,7 markdown=True,8 stream=True,9)1011scrape_agent.print_response(12 "Use the scrape tool to get the complete raw HTML content from https://en.wikipedia.org/wiki/2025_FIFA_Club_World_Cup"13)All Functions with JavaScript Rendering
Enable all ScrapeGraph functions with heavy JavaScript support:
1# Enable all ScrapeGraph functions2scrapegraph_all = Agent(3 tools=[4 ScrapeGraphTools(all=True, render_heavy_js=True)5 ], # render_heavy_js=True scrapes all JavaScript6 model=agent_model,7 markdown=True,8 stream=True,9)1011scrapegraph_all.print_response("""12Use any appropriate scraping method to extract comprehensive information from https://www.wired.com/category/science/:13- News articles and headlines14- Convert to markdown if needed15- Search for specific information16""")Toolkit Params
| Parameter | Type | Default | Description |
|---|---|---|---|
api_key | Optional[str] | None | ScrapeGraph API key. If not provided, uses SGAI_API_KEY environment variable. |
enable_smartscraper | bool | True | Enable the smartscraper function for LLM-powered data extraction. |
enable_markdownify | bool | False | Enable the markdownify function for webpage to markdown conversion. |
enable_crawl | bool | False | Enable the crawl function for website crawling and data extraction. |
enable_searchscraper | bool | False | Enable the searchscraper function for web search and information extraction. |
enable_agentic_crawler | bool | False | Enable the agentic_crawler function for automated browser actions and AI extraction. |
enable_scrape | bool | False | Enable the scrape function for retrieving raw HTML content from websites. |
render_heavy_js | bool | False | Enable heavy JavaScript rendering for all scraping functions. Useful for SPAs and dynamic content. |
all | bool | False | Enable all available functions. When True, all enable flags are ignored. |
Toolkit Functions
| Function | Description |
|---|---|
smartscraper | Extract structured data from a webpage using LLM and natural language prompt. Parameters: url (str), prompt (str). |
markdownify | Convert a webpage to markdown format. Parameters: url (str). |
crawl | Crawl a website and extract structured data. Parameters: url (str), prompt (str), data_schema (dict), cache_website (bool), depth (int), max_pages (int), same_domain_only (bool), batch_size (int). |
searchscraper | Search the web and extract information. Parameters: user_prompt (str). |
agentic_crawler | Perform automated browser actions with optional AI extraction. Parameters: url (str), steps (List[str]), use_session (bool), user_prompt (Optional[str]), output_schema (Optional[dict]), ai_extraction (bool). |
scrape | Get raw HTML content from a website. Useful for complete source code retrieval and custom processing. Parameters: website_url (str), headers (Optional[dict]). |