Scrappy is a Go HTTP service that uses a warm Chrome pool to provide fast rendering and extraction APIs.
Endpoints
| Method | Path | Purpose |
|---|---|---|
POST |
/html |
Return page HTML |
POST |
/markdown |
Return extracted markdown-like content |
POST |
/screenshot |
Capture screenshot and upload to R2 |
POST |
/pool/scale |
Resize browser pool (admin token required when configured) |
GET |
/stats |
Inspect pool health/utilization (admin token required when configured) |
GET |
/health |
Liveness endpoint (bypasses IP allowlist) |
Requirements
- Go 1.26+
- Chrome/Chromium available (or let Rod launcher manage it)
- Optional: Cloudflare R2 credentials for
/screenshot
Quick Start
- Install dependencies:
- Configure environment:
Then edit .env and set required values.
- Start service:
Default bind address is :3000 (SCRAPPY_ADDR).
If calling from a non-local IP in development, set SCRAPPY_ALLOWED_IPS accordingly.
API Usage
Render HTML
curl -X POST http://localhost:3000/html \ -H "Content-Type: application/json" \ -d '{"url":"https://example.com","viewport":{"width":1280,"height":800}}'
Extract Markdown
curl -X POST http://localhost:3000/markdown \ -H "Content-Type: application/json" \ -d '{"url":"https://example.com"}'
Capture Screenshot
curl -X POST http://localhost:3000/screenshot \ -H "Content-Type: application/json" \ -d '{"url":"https://example.com","viewport":{"width":1440,"height":756},"format":"jpeg","quality":90,"device_scale_factor":2}'
Scale Pool
curl -X POST http://localhost:3000/pool/scale \ -H "Content-Type: application/json" \ -d '{"size":3}'
CLI
A local CLI is available at cmd/scrappy for script and agent workflows.
Run via Go:
go run ./cmd/scrappy --help
Example commands:
go run ./cmd/scrappy --base-url http://localhost:3000 html \ --url https://example.com go run ./cmd/scrappy --base-url http://localhost:3000 markdown \ --url https://example.com --wait-ms 1500 go run ./cmd/scrappy --base-url http://localhost:3000 screenshot \ --url https://example.com --format webp --quality 90 --device-scale-factor 2 go run ./cmd/scrappy --base-url http://localhost:3000 stats go run ./cmd/scrappy --base-url http://localhost:3000 scale --size 3
Global CLI flags:
--base-url(env:SCRAPPY_BASE_URL, defaulthttp://localhost:3000)--admin-token(env:SCRAPPY_ADMIN_TOKEN)--http-timeout-ms(env:SCRAPPY_HTTP_TIMEOUT_MS, default80000)--pretty(pretty-print JSON output)
MCP Server
An MCP stdio server is available at cmd/scrappy-mcp and exposes these tools:
scrappy_htmlscrappy_markdownscrappy_screenshotscrappy_statsscrappy_scale
Quick local test:
go run ./cmd/scrappy-mcp --help
Run Modes
Use one of these patterns:
- Run from repo root with Go:
go run ./cmd/scrappy-mcp --base-url http://127.0.0.1:3000
- Build once and run from
PATH:
go build -o ./bin/scrappy-mcp ./cmd/scrappy-mcp
Then make sure scrappy-mcp is on your PATH, and run:
scrappy-mcp --base-url http://127.0.0.1:3000
Codex Integration
Codex MCP servers are configured in config.toml under [mcp_servers.<name>].
If Codex is started from this repo root, use go run:
[mcp_servers.scrappy] command = "go" args = ["run", "./cmd/scrappy-mcp", "--base-url", "http://127.0.0.1:3000"]
If you want it to work from any directory, use a binary on PATH:
[mcp_servers.scrappy] command = "scrappy-mcp" args = ["--base-url", "http://127.0.0.1:3000"]
Generic MCP Client Integration
Most MCP clients use a JSON shape like this:
{
"mcpServers": {
"scrappy": {
"command": "scrappy-mcp",
"args": ["--base-url", "http://127.0.0.1:3000"]
}
}
}Auth and Safety
If your /stats or /pool/scale endpoints require auth, set SCRAPPY_ADMIN_TOKEN in the environment used to launch your MCP client, or pass --admin-token.
Recommended setup for least privilege:
scrappy: no admin token (safe default for content extraction tools)scrappy_admin: admin token enabled (only for pool ops when needed)
Typical LLM Prompts
Examples that reliably trigger tool usage:
Use scrappy_markdown for https://example.com and return the top 10 links.Call scrappy_html for https://example.com and extract title + canonical URL.Check scrappy_stats and report if pool saturation is high.If busy instances are above 2, call scrappy_scale with size 5.
Request Fields
Common request fields for /html and /markdown:
url(required)viewport.width/viewport.height(optional)user_agent(optional)wait_ms(optional)timeout_ms(optional)
Request constraints:
- URL must be absolute
http://orhttps://. - URL credentials (
https://user:pass@...) are rejected. - Private/local network targets are blocked by default.
wait_ms/timeout_ms/ viewport /device_scale_factorare capped by server limits.
Additional fields for /screenshot:
format(jpeg,png,webp)quality(ignored for png)device_scale_factor(optional DPR, minimum1, capped bySCRAPPY_MAX_DEVICE_SCALE_FACTOR)
Configuration
Key environment variables:
Server
SCRAPPY_ADDR(default:3000)SCRAPPY_ALLOWED_IPS(comma-separated IPs/CIDRs)SCRAPPY_TRUSTED_PROXY_CIDRS(comma-separated proxy CIDRs allowed to setX-Forwarded-For/X-Real-IP)
Security Controls
SCRAPPY_ALLOWED_TARGET_HOSTS(optional comma-separated host allowlist; supports exact host,.example.com,*.example.com, and CIDR for IP targets)SCRAPPY_BLOCK_PRIVATE_NETWORKS(defaulttrue; blocks localhost/private/link-local/reserved targets)SCRAPPY_ALLOW_LOOPBACK_TARGETS(defaultfalse; whentrue, allowslocalhost/loopback targets for local development)SCRAPPY_ADMIN_TOKEN(optional; protects/statsand/pool/scalewhen set; useAuthorization: Bearer <token>orX-Admin-Token)SCRAPPY_MAX_REQUEST_BODY_BYTES(default1048576)SCRAPPY_MAX_WAIT_MS(default20000)SCRAPPY_MAX_TIMEOUT_MS(default60000)SCRAPPY_MAX_VIEWPORT_WIDTH(default2560)SCRAPPY_MAX_VIEWPORT_HEIGHT(default2560)SCRAPPY_MAX_DEVICE_SCALE_FACTOR(default3)
Browser Pool
BROWSER_POOL_ENABLED(defaulttrue)BROWSER_POOL_MIN_SIZE,BROWSER_POOL_MAX_SIZEBROWSER_POOL_LEASE_TIMEOUTBROWSER_POOL_IDLE_TTLBROWSER_POOL_MAX_REUSEBROWSER_POOL_SPAWN_TIMEOUTBROWSER_POOL_HANG_TIMEOUTBROWSER_POOL_SUPERVISOR_INTERVALBROWSER_POOL_ALLOW_STANDALONE_FALLBACK(defaultfalse)
Note: pool timeout vars above are interpreted as seconds (legacy behavior in config loader).
Legacy aliases (SCRAPPY_POOL_*) are still supported for pool size/timeouts.
Render Defaults
SCRAPPY_DEFAULT_VIEWPORT_WIDTHSCRAPPY_DEFAULT_VIEWPORT_HEIGHTSCRAPPY_DEFAULT_USER_AGENTSCRAPPY_DEFAULT_WAIT_MSSCRAPPY_DEFAULT_TIMEOUT_MSSCRAPPY_DEFAULT_FORMATSCRAPPY_DEFAULT_QUALITYSCRAPPY_DEFAULT_DEVICE_SCALE_FACTOR
Browser Binary
SCRAPPY_CHROME_BIN(optional explicit Chrome/Chromium binary)SCRAPPY_CHROME_NO_SANDBOX(defaultfalse; keep disabled unless strictly required)SCRAPPY_CHROME_USER_DATA_DIR_ROOT(default/tmp/rod/user-data/scrappy; Scrappy stores browser profiles under this app-owned root)SCRAPPY_CHROME_PROFILE_CLEANUP_INTERVAL_SECONDS(default600; set0to disable stale profile janitor)SCRAPPY_CHROME_PROFILE_CLEANUP_MAX_AGE_SECONDS(default3600; directories older than this are pruned from Scrappy's browser profile root unless currently in use)
Note: older Scrappy deployments used Rod's shared default temp root (/tmp/rod/user-data). This change stops new growth there, but existing legacy directories under the old root may still need a one-time manual cleanup.
R2 (required only for /screenshot)
R2_ENDPOINTR2_ACCESS_KEY_IDR2_SECRET_ACCESS_KEYR2_BUCKETR2_PUBLIC_BASE_URLR2_REGION(defaultauto)
Observability
SENTRY_DSN(optional)
Project Layout
Pool code is now split by responsibility:
pool_types.go- types/constantspool_admin.go- constructor/stats/scale/shutdownpool_render.go- public render/markdown/screenshot methodspool_navigation.go- navigation, settle, extraction helperspool_page.go- page lifecycle/setuppool_manager.go- pool internals (spawn/checkout/reap/logging)browser_profiles.go- Rod launcher cleanup and stale profile janitorextraction_scripts.go- browser-evaluated extraction scripts
Request parsing/defaults:
handlers.gooptions.gomodels.go
For flow-level detail, see ARCHITECTURE.md.
Development
Format and test before committing:
gofmt -w *.go go test ./...
Security Notes
- Do not expose this service publicly without network controls and authentication.
- Keep
SCRAPPY_ALLOWED_IPSrestricted to trusted callers. - Configure
SCRAPPY_TRUSTED_PROXY_CIDRSwhen running behind a reverse proxy. - Keep
SCRAPPY_CHROME_NO_SANDBOX=falsein production. - Report vulnerabilities using the process in
SECURITY.md.
Troubleshooting Markdown Extraction
For dynamic pages (for example Webflow job pages), extraction can fail if links only exist in hidden or late-rendered nodes.
- The extractor ignores hidden content (
display:none,visibility:hidden,aria-hidden="true",.w-condition-invisible,.hide). - Root selection prefers semantic/main containers and falls back to
document.body. - Link extraction converts relative URLs to absolute URLs using
document.baseURI.
Debug sequence for extraction issues:
- Call
/htmlfor the same URL and confirm the target<a href>exists in rendered HTML. - Verify the link is not inside hidden variants/duplicated mobile or desktop nav containers.
- Re-run
/markdownwith a largerwait_msif content is injected after initial paint. - Check
/statsfor pool errors, stale pages, or repeated timeouts.
Deployment
- Dockerized via:
Dockerfile(current project image with extra custom fonts)Dockerfile.sample(generic baseline image without custom font bundle)
- Kamal templates are included:
config/deploy.example.yml.kamal/secrets.example
- Create local deployment files before running Kamal:
cp config/deploy.example.yml config/deploy.yml cp .kamal/secrets.example .kamal/secrets
- Edit local copies with your registry, hosts, secrets source, and SSH user.
- Keep
config/deploy.ymland.kamal/secretsprivate; both are gitignored by default.
License
MIT. See LICENSE.