Feapder Dev
Quick Intake
Ask the user for the minimum information needed to produce a runnable crawler:
- Target site/app constraints (login, JS rendering, anti-bot, rate limits).
- Data schema (fields), uniqueness key, and storage target (file/DB/queue).
- Entry points (seed URLs, keywords, categories) and pagination strategy.
- Scale constraints (single machine vs distributed; incremental vs full; schedule).
- Environment constraints (OS, Python version, network/proxy availability).
Workflow Decision Tree
Decide the crawler architecture before writing code:
-
Choose crawler type:
- Choose
AirSpiderfor small jobs, no distributed scheduling, simple persistence. - Choose
Spiderfor high-volume, distributed crawling, and persistent task queues. - Choose
TaskSpiderfor explicit task tables/queues with retry/continuation semantics. - Choose
BatchSpiderfor periodic batches with batch metadata and separation.
- Choose
-
Choose rendering:
- Use HTTP-only requests when pages are static or the API is reachable.
- Use browser rendering only when content is JS-rendered or requires complex interaction.
-
Choose persistence and flow:
- Use
Item+ pipeline for normalized storage. - Use explicit DB access helpers only when pipeline is insufficient.
- Use
-
Choose anti-bot strategy:
- Add rate limits/backoff/retries first.
- Add proxy rotation/user pool only when needed and with observability.
Build Steps
Implement the smallest runnable crawler first, then iterate:
1) Set up environment (prefer uv)
- Create a project venv with
uvand pin a compatible Python version. - Install
feapderand any runtime dependencies (DB drivers, playwright/selenium if used). - Verify import and basic CLI availability before writing spider code.
2) Scaffold a minimal project
- Create a clean module layout:
spiders/,parsers/,items/,pipelines/,settings.py. - Start with one spider, one parser, and one item type.
3) Implement a minimal spider + parser
Keep the first version minimal and end-to-end runnable:
from feapder import AirSpider
from feapder.network.request import Request
class DemoSpider(AirSpider):
def start_requests(self):
yield Request("https://example.com")
def parse(self, request, response):
# extract fields -> yield Item or dict
yield {"url": response.url}
4) Add Item + pipeline (only after extraction works)
- Define a stable unique key for de-duplication.
- Add a pipeline that validates fields, normalizes types, and writes to the chosen sink.
5) Configure settings deliberately
- Keep settings minimal at first.
- Turn on logging suitable for debugging.
- Add retries/timeouts/backoff before adding heavier defenses.
6) Run and debug iteratively
- Run one seed URL first.
- Add pagination/expansion only after single-page extraction is correct.
- Use logs and counters (success/fail/empty) to validate progress.
7) Scale up (when required)
- Switch to
Spider/TaskSpider/BatchSpideronly when the workload or scheduling requires it. - Add Redis/DB infrastructure as a deliberate dependency and document required services.
Debugging Checklist
Use a checklist-driven approach before changing architecture:
- Confirm the request layer works: DNS/SSL/proxy, timeouts, status codes, encoding.
- Confirm selectors/JSON parsing against real responses (save a sample response).
- Confirm that
parse()yields items/requests as expected (no silent drops). - Confirm pipeline and settings are loaded (wrong module path is a common cause).
- Add observability: log key decisions, count produced items, count retries and errors.
References
Read these only when needed to keep context small:
references/checklist.md: Common tasks and quick checks for Feapder projects.references/patterns.md: Lightweight patterns for spider types, parsing, and pipelines.references/official-docs/: Local copy of Feapder official docs (Markdown).references/official-docs.md: Short note on using the local docs copy.
