askill
web-scraper

web-scraperSafety 85Repository

Scrape web pages, search the internet, and extract structured content using Python. Use when the user wants to fetch a webpage, search for information online, extract links, or crawl JavaScript-rendered dynamic pages.

227 stars
4.5k downloads
Updated 3/20/2026

Package Files

Loading files...
SKILL.md

Web Scraper

Fetch, search, and extract content from websites.

When to use this skill

  • User asks to fetch or read a webpage / URL
  • User wants to search the internet for information
  • User needs to extract links, tables, or structured data from a website
  • User asks to crawl a JavaScript-rendered (dynamic) page
  • User wants web content converted to clean Markdown for analysis

Scripts overview

ScriptPurposeDependencies
fetch_page.pyFetch a URL and extract readable content as Markdownrequests, beautifulsoup4, readability-lxml, html2text
search_web.pySearch the web via DuckDuckGoddgs
crawl_dynamic.pyCrawl JS-rendered pages with a headless browsercrawl4ai
extract_links.pyExtract and categorize all links from a pagerequests, beautifulsoup4

Steps

1. Install dependencies (first time only)

For lightweight scraping (static pages, search, link extraction):

pip install requests beautifulsoup4 readability-lxml html2text ddgs

For dynamic / JavaScript-rendered pages (heavier, installs Playwright + Chromium):

pip install crawl4ai
crawl4ai-setup

Note: crawl4ai-setup downloads a Chromium browser (~150 MB). Only install if you actually need dynamic page support.

CRITICAL — Dependency Error Recovery: If ANY script below fails with an ImportError or "module not found" error, install the missing dependencies using the command above, then re-run the EXACT SAME script command that failed. Do NOT write inline Python code (python -c "...") or your own ad-hoc scripts as a substitute. These scripts handle encoding, error handling, and output formatting that inline code will miss.

2. Fetch a web page (static — recommended first choice)

Use this for most websites. It's fast, lightweight, and works for articles, docs, blogs, etc.

python scripts/fetch_page.py "URL"

Options:

  • --raw — Output full page Markdown instead of extracted article content
  • --selector "CSS_SELECTOR" — Extract only elements matching the CSS selector (e.g. ".article-body", "table", "#content")
  • --save OUTPUT_PATH — Also save output to a file
  • --max-length N — Truncate output to N characters (default: no limit)

Examples:

# Fetch an article
python fetch_page.py "https://example.com/article"

# Extract only tables
python fetch_page.py "https://example.com/data" --selector "table"

# Fetch raw full-page markdown, limit to 5000 chars
python fetch_page.py "https://example.com" --raw --max-length 5000

3. Search the web

Search using DuckDuckGo (no API key required).

python scripts/search_web.py "search query"

Options:

  • --max-results N — Number of results to return (default: 10)
  • --region REGION — Region code, e.g. cn-zh, us-en, jp-jp (default: wt-wt for worldwide)
  • --news — Search news instead of general web

Examples:

# General search
python search_web.py "Python web scraping best practices 2025"

# News search, Chinese region, 5 results
python search_web.py "AI 最新进展" --news --region cn-zh --max-results 5

4. Crawl a dynamic / JavaScript-rendered page

Use this only when fetch_page.py returns empty or incomplete content (SPA, React/Vue apps, pages that load content via JS).

python scripts/crawl_dynamic.py "URL"

Options:

  • --wait N — Wait N seconds after page load for JS to finish (default: 3)
  • --selector "CSS_SELECTOR" — Wait for a specific element to appear before extracting
  • --scroll — Scroll to bottom of page to trigger lazy loading
  • --save OUTPUT_PATH — Also save output to a file
  • --max-length N — Truncate output to N characters

5. Extract links from a page

Extract all links with their text labels, categorized by type (internal, external, resource).

python scripts/extract_links.py "URL"

Options:

  • --filter PATTERN — Only show links matching a regex pattern (applied to URL)
  • --external-only — Only show external links
  • --json — Output as JSON instead of Markdown

Decision guide: which script to use

  1. Start with fetch_page.py — handles 90% of websites (articles, docs, blogs, wikis).
  2. If fetch_page.py returns empty/garbled content → try crawl_dynamic.py (the page likely needs JavaScript).
  3. Need to find URLs first? → Use search_web.py to discover relevant pages.
  4. Need to navigate a site structure? → Use extract_links.py to map out links, then fetch individual pages.

Common workflows

Research a topic

  1. search_web.py "topic" → get relevant URLs
  2. fetch_page.py "best_url" → read the content
  3. Repeat for multiple sources, then synthesize

Scrape structured data from a page

  1. fetch_page.py "url" --selector "table" → extract tables
  2. Or fetch_page.py "url" --selector ".product-card" → extract specific elements

Crawl a modern web app (SPA)

  1. crawl_dynamic.py "url" --wait 5 --scroll → full JS-rendered content

Edge cases

  • Paywalled sites: May return partial content or login pages. Inform the user.
  • Rate limiting / CAPTCHAs: If requests fail with 403/429, wait and retry or inform the user.
  • Very large pages: Use --max-length to truncate output and avoid overwhelming the context window.
  • Encoding issues: Scripts handle UTF-8 by default. Exotic encodings may need manual adjustment.
  • Robots.txt: These scripts do not check robots.txt. Use responsibly and respect website terms of service.

Scripts

Install

Download ZIP
Requires askill CLI v1.0+

AI Quality Score

92/100Analyzed 3/22/2026

Highly comprehensive web scraping skill with excellent structure. Includes clear "when to use" section, multiple well-documented scripts with options, decision guide, common workflows, and edge cases. Provides actionable step-by-step instructions with dependency management and error recovery. Tags improve discoverability. Generic enough to be highly reusable across projects. Minor deduction for lack of https icon.

85
95
90
95
95

Metadata

Licenseunknown
Version-
Updated3/20/2026
PublisherAIDotNet

Tags

api