Skip to content

Commit fad0e1c

Browse files
committed
feat!: migrate MCP server to ScrapeGraph API v2
Align with scrapegraph-py PR #82: base URL /api/v2, Bearer + SGAI-APIKEY headers, X-SDK-Version, and v2 endpoints for scrape, extract, search, crawl, credits, history, and monitor. BREAKING CHANGE: Removes sitemap, agentic_scrapper, markdownify_status, and smartscraper_status. Crawl supports markdown/html only (no AI crawl mode). Adds crawl_stop, crawl_resume, credits, sgai_history, and monitor_* tools. Optional SCRAPEGRAPH_API_BASE_URL for custom API hosts. Made-with: Cursor
1 parent f8161ff commit fad0e1c

6 files changed

Lines changed: 665 additions & 1391 deletions

File tree

.agent/README.md

Lines changed: 15 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -12,7 +12,7 @@ Complete system architecture documentation including:
1212
- **Technology Stack** - Python 3.10+, FastMCP, httpx dependencies
1313
- **Project Structure** - File organization and key files
1414
- **Core Architecture** - MCP design, server architecture, patterns
15-
- **MCP Tools** - All 5 tools (markdownify, smartscraper, searchscraper, smartcrawler_initiate, smartcrawler_fetch_results)
15+
- **MCP Tools** - API v2 tools (markdownify, scrape, smartscraper, searchscraper, crawl, credits, history, monitor, …)
1616
- **API Integration** - ScrapeGraphAI API endpoints and credit system
1717
- **Deployment** - Smithery, Claude Desktop, Cursor, Docker setup
1818
- **Recent Updates** - SmartCrawler integration and latest features
@@ -95,7 +95,7 @@ Complete Model Context Protocol integration documentation:
9595

9696
**...available tools and their parameters:**
9797
- Read: [Project Architecture - MCP Tools](./system/project_architecture.md#mcp-tools)
98-
- Quick reference: 5 tools (markdownify, smartscraper, searchscraper, smartcrawler_initiate, smartcrawler_fetch_results)
98+
- Quick reference: see README “Available Tools” table (v2: + scrape, crawl_stop/resume, credits, sgai_history, monitor_*; removed sitemap, agentic_scrapper, *\_status tools)
9999

100100
**...error handling:**
101101
- Read: [MCP Protocol - Error Handling](./system/mcp_protocol.md#error-handling)
@@ -134,6 +134,7 @@ npx @modelcontextprotocol/inspector scrapegraph-mcp
134134
**Manual Testing (stdio):**
135135
```bash
136136
echo '{"jsonrpc":"2.0","method":"tools/call","params":{"name":"markdownify","arguments":{"website_url":"https://scrapegraphai.com"}},"id":1}' | scrapegraph-mcp
137+
# (v2: same tool name; backend calls POST /scrape)
137138
```
138139

139140
**Integration Testing (Claude Desktop):**
@@ -174,13 +175,14 @@ echo '{"jsonrpc":"2.0","method":"tools/list","id":1}' | docker run -i -e SGAI_AP
174175

175176
Quick reference to all MCP tools:
176177

177-
| Tool | Parameters | Purpose | Credits | Async |
178-
|------|------------|---------|---------|-------|
179-
| `markdownify` | `website_url` | Convert webpage to markdown | 2 | No |
180-
| `smartscraper` | `user_prompt`, `website_url`, `number_of_scrolls?`, `markdown_only?` | AI-powered data extraction | 10+ | No |
181-
| `searchscraper` | `user_prompt`, `num_results?`, `number_of_scrolls?`, `time_range?` | AI-powered web search | Variable | No |
182-
| `smartcrawler_initiate` | `url`, `prompt?`, `extraction_mode`, `depth?`, `max_pages?`, `same_domain_only?` | Start multi-page crawl | 100+ | Yes (returns request_id) |
183-
| `smartcrawler_fetch_results` | `request_id` | Get crawl results | N/A | No (polls status) |
178+
| Tool | Notes |
179+
|------|--------|
180+
| `markdownify` / `scrape` | POST /scrape (v2) |
181+
| `smartscraper` | POST /extract; URL only |
182+
| `searchscraper` | POST /search; num_results 3–20 |
183+
| `smartcrawler_*`, `crawl_stop`, `crawl_resume` | POST/GET /crawl |
184+
| `credits`, `sgai_history` | GET /credits, /history |
185+
| `monitor_*` | /monitor namespace |
184186

185187
For detailed tool documentation, see [Project Architecture - MCP Tools](./system/project_architecture.md#mcp-tools).
186188

@@ -376,8 +378,11 @@ npx @modelcontextprotocol/inspector scrapegraph-mcp
376378

377379
## 📅 Changelog
378380

381+
### April 2026
382+
- ✅ Migrated MCP client and tools to **API v2** ([scrapegraph-py#82](https://github.com/ScrapeGraphAI/scrapegraph-py/pull/82)): base `https://api.scrapegraphai.com/api/v2`, Bearer + SGAI-APIKEY, new crawl/monitor/credits/history tools; removed sitemap, agentic_scrapper, status polling tools.
383+
379384
### January 2026
380-
- ✅ Added `time_range` parameter to SearchScraper for filtering results by recency
385+
- ✅ Added `time_range` parameter to SearchScraper for filtering results by recency (v1-era; **ignored on API v2**)
381386
- ✅ Supported time ranges: `past_hour`, `past_24_hours`, `past_week`, `past_month`, `past_year`
382387
- ✅ Documentation updated to reflect SDK changes (scrapegraph-py#77, scrapegraph-js#2)
383388

.agent/system/project_architecture.md

Lines changed: 35 additions & 24 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
# ScrapeGraph MCP Server - Project Architecture
22

3-
**Last Updated:** January 2026
4-
**Version:** 1.0.0
3+
**Last Updated:** April 2026
4+
**Version:** 2.0.0
55

66
## Table of Contents
77
- [System Overview](#system-overview)
@@ -19,11 +19,12 @@
1919

2020
The ScrapeGraph MCP Server is a production-ready [Model Context Protocol](https://modelcontextprotocol.io/introduction) (MCP) server that provides seamless integration between AI assistants (like Claude, Cursor, etc.) and the [ScrapeGraphAI API](https://scrapegraphai.com). This server enables language models to leverage advanced AI-powered web scraping capabilities with enterprise-grade reliability.
2121

22-
**Key Capabilities:**
23-
- **Markdownify** - Convert webpages to clean, structured markdown
24-
- **SmartScraper** - AI-powered structured data extraction from webpages
25-
- **SearchScraper** - AI-powered web searches with structured results
26-
- **SmartCrawler** - Intelligent multi-page web crawling with AI extraction or markdown conversion
22+
**Key Capabilities (API v2):**
23+
- **Scrape** (`markdownify`, `scrape`) — POST `/api/v2/scrape`
24+
- **Extract** (`smartscraper`) — POST `/api/v2/extract` (URL-only)
25+
- **Search** (`searchscraper`) — POST `/api/v2/search`
26+
- **Crawl** — POST/GET `/api/v2/crawl` (+ stop/resume); markdown/html crawl only
27+
- **Monitor, credits, history**`/api/v2/monitor`, `/credits`, `/history`
2728

2829
**Purpose:**
2930
- Bridge AI assistants (Claude, Cursor, etc.) with web scraping capabilities
@@ -129,7 +130,7 @@ AI Assistant (Claude/Cursor)
129130
↓ (stdio via MCP)
130131
FastMCP Server (this project)
131132
↓ (HTTPS API calls)
132-
ScrapeGraphAI API (https://api.scrapegraphai.com/v1)
133+
ScrapeGraphAI API (default https://api.scrapegraphai.com/api/v2)
133134
↓ (web scraping)
134135
Target Websites
135136
```
@@ -139,10 +140,10 @@ Target Websites
139140
The server follows a simple, single-file architecture:
140141

141142
**`ScapeGraphClient` Class:**
142-
- HTTP client wrapper for ScrapeGraphAI API
143-
- Base URL: `https://api.scrapegraphai.com/v1`
144-
- API key authentication via `SGAI-APIKEY` header
145-
- Methods: `markdownify()`, `smartscraper()`, `searchscraper()`, `smartcrawler_initiate()`, `smartcrawler_fetch_results()`
143+
- HTTP client wrapper for ScrapeGraphAI API v2 ([scrapegraph-py#82](https://github.com/ScrapeGraphAI/scrapegraph-py/pull/82))
144+
- Base URL: `https://api.scrapegraphai.com/api/v2` (override with env `SCRAPEGRAPH_API_BASE_URL`)
145+
- Auth: `Authorization: Bearer`, `SGAI-APIKEY`, `X-SDK-Version: scrapegraph-mcp@2.0.0`
146+
- v2 methods include `scrape_v2`, `extract`, `search_api`, `crawl_*`, `monitor_*`, `credits`, `history`, plus compatibility wrappers used by MCP tools
146147

147148
**FastMCP Server:**
148149
- Created with `FastMCP("ScapeGraph API MCP Server")`
@@ -185,7 +186,9 @@ The server follows a simple, single-file architecture:
185186

186187
## MCP Tools
187188

188-
The server exposes 5 tools to AI assistants:
189+
The server exposes many `@mcp.tool()` handlers (see repository `README.md` for the full table). The detailed subsections below still use **v1-style endpoint names** in several places; treat them as illustrative and prefer the v2 mapping in **API Integration**.
190+
191+
**v2 tool names:** `markdownify`, `scrape`, `smartscraper`, `searchscraper`, `smartcrawler_initiate`, `smartcrawler_fetch_results`, `crawl_stop`, `crawl_resume`, `credits`, `sgai_history`, `monitor_create`, `monitor_list`, `monitor_get`, `monitor_pause`, `monitor_resume`, `monitor_delete`.
189192

190193
### 1. `markdownify(website_url: str)`
191194

@@ -388,21 +391,29 @@ If status is "completed":
388391

389392
### ScrapeGraphAI API
390393

391-
**Base URL:** `https://api.scrapegraphai.com/v1`
394+
**Base URL:** `https://api.scrapegraphai.com/api/v2` (configurable via `SCRAPEGRAPH_API_BASE_URL`)
392395

393396
**Authentication:**
394-
- Header: `SGAI-APIKEY: your-api-key`
397+
- Headers: `Authorization: Bearer <key>`, `SGAI-APIKEY: <key>`
395398
- Obtain API key from: [ScrapeGraph Dashboard](https://dashboard.scrapegraphai.com)
396399

397-
**Endpoints Used:**
398-
399-
| Endpoint | Method | Tool |
400-
|----------|--------|------|
401-
| `/v1/markdownify` | POST | `markdownify()` |
402-
| `/v1/smartscraper` | POST | `smartscraper()` |
403-
| `/v1/searchscraper` | POST | `searchscraper()` |
404-
| `/v1/crawl` | POST | `smartcrawler_initiate()` |
405-
| `/v1/crawl/{request_id}` | GET | `smartcrawler_fetch_results()` |
400+
**Endpoints used (v2):**
401+
402+
| Endpoint | Method | MCP tools (typical) |
403+
|----------|--------|---------------------|
404+
| `/scrape` | POST | `markdownify`, `scrape` |
405+
| `/extract` | POST | `smartscraper` |
406+
| `/search` | POST | `searchscraper` |
407+
| `/crawl` | POST | `smartcrawler_initiate` |
408+
| `/crawl/{id}` | GET | `smartcrawler_fetch_results` |
409+
| `/crawl/{id}/stop` | POST | `crawl_stop` |
410+
| `/crawl/{id}/resume` | POST | `crawl_resume` |
411+
| `/credits` | GET | `credits` |
412+
| `/history` | GET | `sgai_history` |
413+
| `/monitor` | POST, GET | `monitor_create`, `monitor_list` |
414+
| `/monitor/{id}` | GET, DELETE | `monitor_get`, `monitor_delete` |
415+
| `/monitor/{id}/pause` | POST | `monitor_pause` |
416+
| `/monitor/{id}/resume` | POST | `monitor_resume` |
406417

407418
**Request Format:**
408419
```json

README.md

Lines changed: 37 additions & 128 deletions
Original file line numberDiff line numberDiff line change
@@ -26,18 +26,21 @@ A production-ready [Model Context Protocol](https://modelcontextprotocol.io/intr
2626
- [Technology Stack](#technology-stack)
2727
- [License](#license)
2828

29+
## API v2
30+
31+
This MCP server targets **ScrapeGraph API v2** (`https://api.scrapegraphai.com/api/v2`), aligned with
32+
[scrapegraph-py PR #82](https://github.com/ScrapeGraphAI/scrapegraph-py/pull/82). Auth sends both
33+
`Authorization: Bearer` and `SGAI-APIKEY`. Override the base URL with **`SCRAPEGRAPH_API_BASE_URL`** if needed.
34+
2935
## Key Features
3036

31-
- **8 Powerful Tools**: From simple markdown conversion to complex multi-page crawling and agentic workflows
32-
- **AI-Powered Extraction**: Intelligently extract structured data using natural language prompts
33-
- **Multi-Page Crawling**: SmartCrawler supports asynchronous crawling with configurable depth and page limits
34-
- **Infinite Scroll Support**: Handle dynamic content loading with configurable scroll counts
35-
- **JavaScript Rendering**: Full support for JavaScript-heavy websites
36-
- **Flexible Output Formats**: Get results as markdown, structured JSON, or custom schemas
37-
- **Easy Integration**: Works seamlessly with Claude Desktop, Cursor, and any MCP-compatible client
38-
- **Enterprise-Ready**: Robust error handling, timeout management, and production-tested reliability
39-
- **Simple Deployment**: One-command installation via Smithery or manual setup
40-
- **Comprehensive Documentation**: Detailed developer docs in `.agent/` folder
37+
- **Scrape & extract**: `markdownify` / `scrape` (POST /scrape), `smartscraper` (POST /extract, URL only)
38+
- **Search**: `searchscraper` (POST /search; `num_results` clamped 3–20)
39+
- **Crawl**: Async multi-page crawl in **markdown** or **html** only; `crawl_stop` / `crawl_resume`
40+
- **Monitors**: Scheduled jobs via `monitor_create`, `monitor_list`, `monitor_get`, pause/resume/delete
41+
- **Account**: `credits`, `sgai_history`
42+
- **Easy integration**: Claude Desktop, Cursor, Smithery, HTTP transport
43+
- **Developer docs**: `.agent/` folder
4144

4245
## Quick Start
4346

@@ -62,112 +65,20 @@ That's it! The server is now available to your AI assistant.
6265

6366
## Available Tools
6467

65-
The server provides **8 enterprise-ready tools** for AI-powered web scraping:
66-
67-
### Core Scraping Tools
68-
69-
#### 1. `markdownify`
70-
Transform any webpage into clean, structured markdown format.
71-
72-
```python
73-
markdownify(website_url: str)
74-
```
75-
- **Credits**: 2 per request
76-
- **Use case**: Quick webpage content extraction in markdown
77-
78-
#### 2. `smartscraper`
79-
Leverage AI to extract structured data from any webpage with support for infinite scrolling.
80-
81-
```python
82-
smartscraper(
83-
user_prompt: str,
84-
website_url: str,
85-
number_of_scrolls: int = None,
86-
markdown_only: bool = None
87-
)
88-
```
89-
- **Credits**: 10+ (base) + variable based on scrolling
90-
- **Use case**: AI-powered data extraction with custom prompts
91-
92-
#### 3. `searchscraper`
93-
Execute AI-powered web searches with structured, actionable results.
94-
95-
```python
96-
searchscraper(
97-
user_prompt: str,
98-
num_results: int = None,
99-
number_of_scrolls: int = None,
100-
time_range: str = None # Filter by: past_hour, past_24_hours, past_week, past_month, past_year
101-
)
102-
```
103-
- **Credits**: Variable (3-20 websites × 10 credits)
104-
- **Use case**: Multi-source research and data aggregation
105-
- **Time filtering**: Use `time_range` to filter results by recency (e.g., `"past_week"` for recent results)
106-
107-
### Advanced Scraping Tools
108-
109-
#### 4. `scrape`
110-
Basic scraping endpoint to fetch page content with optional heavy JavaScript rendering.
111-
112-
```python
113-
scrape(website_url: str, render_heavy_js: bool = None)
114-
```
115-
- **Use case**: Simple page content fetching with JS rendering support
116-
117-
#### 5. `sitemap`
118-
Extract sitemap URLs and structure for any website.
119-
120-
```python
121-
sitemap(website_url: str)
122-
```
123-
- **Use case**: Website structure analysis and URL discovery
124-
125-
### Multi-Page Crawling
126-
127-
#### 6. `smartcrawler_initiate`
128-
Initiate intelligent multi-page web crawling (asynchronous operation).
129-
130-
```python
131-
smartcrawler_initiate(
132-
url: str,
133-
prompt: str = None,
134-
extraction_mode: str = "ai",
135-
depth: int = None,
136-
max_pages: int = None,
137-
same_domain_only: bool = None
138-
)
139-
```
140-
- **AI Extraction Mode**: 10 credits per page - extracts structured data
141-
- **Markdown Mode**: 2 credits per page - converts to markdown
142-
- **Returns**: `request_id` for polling
143-
- **Use case**: Large-scale website crawling and data extraction
144-
145-
#### 7. `smartcrawler_fetch_results`
146-
Retrieve results from asynchronous crawling operations.
147-
148-
```python
149-
smartcrawler_fetch_results(request_id: str)
150-
```
151-
- **Returns**: Status and results when crawling is complete
152-
- **Use case**: Poll for crawl completion and retrieve results
153-
154-
### Intelligent Agent-Based Scraping
155-
156-
#### 8. `agentic_scrapper`
157-
Run advanced agentic scraping workflows with customizable steps and structured output schemas.
158-
159-
```python
160-
agentic_scrapper(
161-
url: str,
162-
user_prompt: str = None,
163-
output_schema: dict = None,
164-
steps: list = None,
165-
ai_extraction: bool = None,
166-
persistent_session: bool = None,
167-
timeout_seconds: float = None
168-
)
169-
```
170-
- **Use case**: Complex multi-step workflows with custom schemas and persistent sessions
68+
| Tool | Role |
69+
|------|------|
70+
| `markdownify` | POST /scrape (markdown) |
71+
| `scrape` | POST /scrape (`output_format`: markdown, html, screenshot, branding) |
72+
| `smartscraper` | POST /extract (requires `website_url`; no inline HTML/markdown body on v2) |
73+
| `searchscraper` | POST /search (`num_results` 3–20; `time_range` / `number_of_scrolls` ignored on v2) |
74+
| `smartcrawler_initiate` | POST /crawl — `extraction_mode` **`markdown`** or **`html`** (default markdown). No AI crawl across pages. |
75+
| `smartcrawler_fetch_results` | GET /crawl/:id |
76+
| `crawl_stop`, `crawl_resume` | POST /crawl/:id/stop \| resume |
77+
| `credits` | GET /credits |
78+
| `sgai_history` | GET /history |
79+
| `monitor_create`, `monitor_list`, `monitor_get`, `monitor_pause`, `monitor_resume`, `monitor_delete` | /monitor API |
80+
81+
**Removed vs older MCP releases:** `sitemap`, `agentic_scrapper`, `markdownify_status`, `smartscraper_status` (no v2 endpoints).
17182

17283
## Setup Instructions
17384

@@ -482,7 +393,7 @@ root_agent = LlmAgent(
482393
- Adjust based on your use case (crawling operations may need even longer timeouts)
483394

484395
**Tool Filtering:**
485-
- By default, all 8 tools are exposed to the agent
396+
- By default, all registered MCP tools are exposed to the agent (see [Available Tools](#available-tools))
486397
- Use `tool_filter` to limit which tools are available:
487398
```python
488399
tool_filter=['markdownify', 'smartscraper', 'searchscraper']
@@ -520,20 +431,18 @@ The server enables sophisticated queries across various scraping scenarios:
520431
- **SearchScraper**: "Research and summarize recent developments in AI-powered web scraping"
521432
- **SearchScraper**: "Search for the top 5 articles about machine learning frameworks and extract key insights"
522433
- **SearchScraper**: "Find recent news about GPT-4 and provide a structured summary"
523-
- **SearchScraper with time_range**: "Search for AI news from the past week only" (uses `time_range="past_week"`)
434+
- **SearchScraper**: v2 does not apply `time_range`; phrase queries to bias recency in natural language instead
524435

525-
### Website Analysis
526-
- **Sitemap**: "Extract the complete sitemap structure from the ScrapeGraph website"
527-
- **Sitemap**: "Discover all URLs on this blog site"
436+
### Website analysis
437+
- Use **`smartcrawler_initiate`** (markdown/html) plus **`smartcrawler_fetch_results`** to map and capture multi-page content; there is no separate **sitemap** tool on v2.
528438

529-
### Multi-Page Crawling
530-
- **SmartCrawler (AI mode)**: "Crawl the entire documentation site and extract all API endpoints with descriptions"
531-
- **SmartCrawler (Markdown mode)**: "Convert all pages in the blog to markdown up to 2 levels deep"
532-
- **SmartCrawler**: "Extract all product information from an e-commerce site, maximum 100 pages, same domain only"
439+
### Multi-page crawling
440+
- **SmartCrawler (markdown/html)**: "Crawl the blog in markdown mode and poll until complete"
441+
- For structured fields per page, run **`smartscraper`** on individual URLs (or **`monitor_create`** on a schedule)
533442

534-
### Advanced Agentic Scraping
535-
- **Agentic Scraper**: "Navigate through a multi-step authentication form and extract user dashboard data"
536-
- **Agentic Scraper with schema**: "Follow pagination links and compile a dataset with schema: {title, author, date, content}"
443+
### Monitors and account
444+
- **Monitor**: "Run this extract prompt on https://example.com every day at 9am" (`monitor_create` with cron)
445+
- **Credits / history**: `credits`, `sgai_history`
537446
- **Agentic Scraper**: "Execute a complex workflow: login, navigate to reports, download data, and extract summary statistics"
538447

539448
## Error Handling

pyproject.toml

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
[project]
22
name = "scrapegraph-mcp"
3-
version = "1.0.1"
3+
version = "2.0.0"
44
description = "MCP server for ScapeGraph API integration"
55
license = {text = "MIT"}
66
readme = "README.md"
@@ -52,6 +52,10 @@ line-length = 100
5252
target-version = "py312"
5353
select = ["E", "F", "I", "B", "W"]
5454

55+
[tool.ruff.lint.per-file-ignores]
56+
# MCP tool docstrings and embedded guides exceed 100 cols by design.
57+
"src/scrapegraph_mcp/server.py" = ["E501"]
58+
5559
[tool.mypy]
5660
python_version = "3.12"
5761
warn_return_any = true

0 commit comments

Comments
 (0)