scrape_url - Web Crawling with Search
Core Concept
mcp__plugin_kg_kodegen__scrape_url crawls websites, saves content as Markdown, and builds a Tantivy full-text search index. Uses an action-based interface with connection isolation and background execution support.
Actions
| Action | Description | Required Parameters |
|---|---|---|
SEARCH | Search with auto-crawl (RECOMMENDED) | url, query |
CRAWL | Explicit crawl | url |
READ | Check crawl progress | None |
LIST | Show all active crawls | None |
KILL | Cancel crawl | None |
Key Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
action | string | "CRAWL" | Action to perform |
url | string | null | Target URL (required for CRAWL/SEARCH) |
crawl_id | number | 0 | Crawl instance (0, 1, 2...) |
query | string | null | Search query (SEARCH action) |
max_depth | number | 3 | Maximum crawl depth |
limit | number | null | Max pages to crawl |
await_completion_ms | number | 600000 | Timeout (10 min default) |
crawl_rate_rps | number | 2 | Requests per second |
search_limit | number | 10 | Max search results |
search_offset | number | 0 | Search pagination offset |
search_highlight | boolean | true | Highlight matches |
Usage Examples
One-Step Search (Recommended)
Auto-crawls if index doesn't exist:
{
"action": "SEARCH",
"url": "https://ratatui.rs",
"crawl_id": 0,
"query": "layout widgets"
}
Explicit Crawl
{
"action": "CRAWL",
"crawl_id": 0,
"url": "https://docs.rs/tokio"
}
Crawl with Limits
{
"action": "CRAWL",
"url": "https://example.com/docs",
"max_depth": 2,
"limit": 50,
"crawl_rate_rps": 1
}
Check Progress
{
"action": "READ",
"crawl_id": 0
}
List Active Crawls
{ "action": "LIST" }
Cancel Crawl
{
"action": "KILL",
"crawl_id": 0
}
Search Query Syntax
Tantivy supports advanced queries:
| Query Type | Example | Description |
|---|---|---|
| Text | layout components | Search all fields |
| Phrase | "exact phrase" | Exact match |
| Boolean | layout AND widgets | Logical operators |
| Field | title:layout | Search specific field |
| Fuzzy | layot~2 | Allow 2 character differences |
Output Directory Structure
Content saved to .kodegen/citescrape/{domain}/:
.kodegen/citescrape/ratatui.rs/
├── manifest.json # Crawl metadata
├── .search_index/ # Tantivy search index
├── index.md # Homepage
├── tutorials/
│ └── hello-world.md
└── api/
└── widgets.md
Workflows
Research Documentation
SEARCHwith url and query (auto-crawls if needed)- Review results
- Follow up with more specific queries
Full Site Crawl
CRAWLwith url, max_depth, limit- Monitor with
READ - Search with
SEARCHaction
Remember
- SEARCH with url auto-crawls if index missing - simplest approach
- Crawls are isolated by
crawl_id- use different numbers for parallel crawls - Rate limiting default is 2 req/sec - be respectful of servers
- Content saved as Markdown for easy reading
- Search index enables fast full-text queries
- Use
READto check on background crawls - Timeout returns partial results - crawl continues in background
