Show HN: ScrapAI – We scrape 500 sites. AI runs once per site, not per page

3 points by iranu 3 months ago · 3 comments

Reader

very neat!

how do you combat silent failures?

for example, I am scraping website A, getting 500+ pdf files; then they change their layout, the ETL breaks, we autoregenerate it with Claude, but then we get only 450 PDFs. The orchestrator still marks it as a successful run, but we get only part of the data.

Or: the ETL for website B breaks. We use our agentic solution, we successfully repair it, and it completes without errors, but we start missing a few fields that were moved in another sub-page.

Did you encounter any such issues?

iranuOP 3 months ago

Thanks for the comment, great question.
Quick clarification: the AI agent writes the config once and is out of the loop after that. You run crawls yourself or via cron. So the "auto-regenerate and silently get wrong data" scenario doesn't quite apply since there's no agent in the runtime loop.
But configs going stale is a real problem. Two things help:
1. The agent tests on 5 real pages before saving any config. Empty fields = rewrite before it hits production.
2. `./scrapai health --project <n>` tests all your spiders and flags extraction failures. We run it monthly via cron. Broken spider? Point the agent at it, it re-analyzes and fixes.
The gap: result count drops (your 500 to 450 example). Health checks catch broken extraction, not "fewer pages matched." We list structural change detection as an open contribution area in the README.

iranuOP 3 months ago

  Hi HN, I built this. It's been in production across 500+ websites.

  We're a research group that studies online communications. We needed to scrape hundreds of sites regularly — news,
  blogs, forums, policy orgs — and maintain all those scrapers. At 10 sites, individual scrapers were fine. At 200+
  we were spending more time fixing broken scrapers than doing actual work. Every redesign broke something, every new
   site meant another scraper from scratch.

  ScrapAI flips the cost model. You tell an AI agent "add bbc.co.uk to my news project." It analyzes the site, writes
   URL patterns and extraction rules, tests on 5 pages, and saves a JSON config to a database. After that it's just
  Scrapy — no AI in the loop, no per-page inference calls. ~$1-3 in tokens per website with Sonnet 4.5, not per page.

  Cloudflare was the hardest part. Most tools keep a browser open for every request (~5-10s per page). We use
  CloakBrowser (open source, C++ stealth patches, 0.9 reCAPTCHA v3 score) to solve the challenge once, cache the
  cookies, kill the browser, and hit the site with normal HTTP. Re-solves every ~10 minutes. 1,000 pages in ~8
  minutes vs 2+ hours.

  The agent writes JSON configs, not Python. An agent that writes and runs code can do anything an unsupervised
  developer can — one prompt injection from a malicious page and you have a real problem. JSON goes through Pydantic
  validation before it touches the database. Worst case is a bad config that extracts wrong fields. This also makes
  it safe to use as a tool for Claws — structured web data without arbitrary code execution.

  ~4,000 lines of Python. Scrapy, SQLAlchemy, Alembic. Apache 2.0. We recommend Claude Code with Sonnet 4.5 but it
  works with any agent that can read instructions and run shell commands. We tried GLM 4.7 and it performed
  similarly, just slower.

Settings

Show HN: ScrapAI – We scrape 500 sites. AI runs once per site, not per page

Keyboard Shortcuts