askill
web-scraper

web-scraperSafety 100Repository

Web scraper with SPA/JavaScript rendering, page interaction, and JS execution. Two-tier engine (HTTP → Playwright browser). Smart discovery, batch fetch, interactive content extraction, OpenAPI parsing. Use when read_url_content fails, SPA rendering needed, or page interaction required.

5 stars
1.2k downloads
Updated 3/14/2026

Package Files

Loading files...
SKILL.md

Web Scraper

./scrape <command> [options] — Run ./scrape help for full option reference.

Content Precision Levels

The fetch command outputs filtered content by default. Three precision levels:

LevelFlagBehaviorWhen to use
Default(none)PruningContentFilter removes boilerplateMost scenarios — good enough
Precision--selector ".css"Extract only the matched containerProduction-quality docs, zero noise
Raw--rawFull page, no filteringDebugging, page structure analysis

Precision workflow: fetch a sample page → inspect remaining noise → identify content container via CSS selector → re-fetch all pages with --selector.

Engine Architecture

L0: Pure HTTP (httpx + selectolax + markdownify)
    fetch → strip boilerplate → markdownify → clean content

L1: Browser (crawl4ai + Playwright, networkidle wait)
    fetch → [JS interaction] → PruningContentFilter → fit_markdown → clean content
    Handles: SPAs, anti-bot, JS-rendered content, folded/lazy content

Auto mode: L0 probe → detect SPA/empty → fallback to L1

Smart Discovery (3-tier):
    1. sitemap.xml → 2. Nav DOM extraction → 3. Browser deep crawl

Workflow Patterns

Single page fetch

# Auto-detects if browser needed
./scrape fetch https://docs.example.com/api/create

# Force browser for known SPA sites
./scrape fetch https://open.feishu.cn/document/... --engine cdp

# Precision extraction with CSS selector
./scrape fetch https://developer.work.weixin.qq.com/document/path/90196 --engine cdp --selector ".ep-doc-area"

Interactive content extraction (SPA/folded content)

# Auto-expand all collapsed/folded sections (universal preset)
./scrape fetch https://open.feishu.cn/document/... --expand-all --wait 5000

# Scroll through entire page to trigger lazy loading
./scrape fetch https://example.com/infinite-scroll --scroll-full

# Custom JavaScript before extraction
./scrape fetch URL --js "document.querySelectorAll('.expand-btn').forEach(e => e.click())"

# JS from file (for complex interactions)
./scrape fetch URL --js-file /path/to/interact.js --wait 5000

# Wait for specific element before extraction
./scrape fetch URL --wait-for ".api-response-table"

# Combine: expand + custom JS + wait
./scrape fetch URL --expand-all --js "extraAction()" --wait-for ".loaded" -o /tmp/docs/

Execute JavaScript on a page

# Execute JS and return metadata (no markdown conversion)
./scrape exec URL --js "return document.title"

# Extract data from page's JavaScript state
./scrape exec URL --js "return JSON.stringify(window.__NEXT_DATA__)"

# Execute JS then fetch page as markdown
./scrape exec URL --js "expandAll()" --then-fetch -o /tmp/result.md

Batch-download documentation

./scrape fetch https://docs.example.com --auto -o /tmp/docs/
./scrape fetch --from-file urls.txt -o /tmp/docs/
./scrape fetch --from-file urls.txt --merge -o /tmp/docs/all.md

Files are automatically named by page title (e.g., 读取成员.md, Getting_Started.md).

Build organized local documentation

  1. Discover site structure: ./scrape discover <url> --json → get URLs + titles
  2. Filter relevant URLs (Agent selects subset based on user's needs)
  3. Write filtered URLs to a file
  4. Fetch to output directory: ./scrape fetch --from-file urls.txt -o /path/to/docs/
  5. Organize (Agent moves/renames files into logical folder structure)

Analyze site structure

./scrape discover https://docs.example.com
./scrape discover https://docs.example.com --engine cdp --deep --max-pages 100

Extract OpenAPI specs

./scrape openapi https://api.example.com/v3/api-docs -o /tmp/api.md
./scrape openapi https://api.example.com/swagger-ui/ -o /tmp/api.md

Key Options

OptionCommandsDescription
--engine MODEdiscover, fetchauto (default), http (L0 only), cdp (L1 only)
--selector CSSfetch, execCSS selector for content area (precision mode)
--rawfetch, execOutput full page (skip content filtering)
--js CODEfetch, execJavaScript to execute before extraction
--js-file PATHfetch, execJavaScript file to execute before extraction
--wait-for CSSfetchWait for CSS selector to appear before extraction
--expand-allfetchAuto-expand collapsed/folded sections (universal preset)
--scroll-fullfetchScroll entire page to trigger lazy loading
--then-fetchexecAfter JS execution, also fetch page as markdown
--session IDexecReuse browser session for multi-step interactions
--js-onlyexecSkip navigation, execute JS on existing session
--autofetchAuto-discover pages then fetch all
--from-file FILEfetchRead URLs from file (one per line)
-o PATHfetch, exec, openapiOutput directory or file path
--mergefetchMerge all pages into single markdown
--jsondiscover, fetchMachine-readable JSON output
--summary-onlyfetchOnly show URL, title, char count
--max-pages Ndiscover, fetchLimit pages (discover: 200, fetch: 50)
--deepdiscoverFollow links for browser-based BFS crawl

Requirements

  • uv: curl -LsSf https://astral.sh/uv/install.sh | sh
  • Dependencies and Playwright Chromium auto-installed on first run

Install

Download ZIP
Requires askill CLI v1.0+

AI Quality Score

92/100Analyzed 3/28/2026

High-quality technical skill document for a web scraper tool with Playwright integration. Well-structured with clear architecture (two-tier L0/L1 engine), comprehensive workflow patterns, and detailed CLI options. Excellent clarity with tables and copy-paste examples. Covers SPA handling, interactive content extraction, batch download, and OpenAPI parsing. Includes 'when to use' guidance throughout. Minor gap: no dedicated skills folder in path.

100
95
90
95
90

Metadata

Licenseunknown
Version-
Updated3/14/2026
Publishernorthseadl

Tags

apici-cdgithub-actions