Async-first web scraping framework built on wreq (HTTP with browser impersonation) and scraper-rs (fast HTML parsing). Silkworm gives you a minimal Spider/Request/Response model, middlewares, and pipelines so you can script quick scrapes or build larger crawlers without boilerplate.
NEW: Use silkworm-mcp to build scrapers.
Features
- Async engine with configurable concurrency, bounded queue backpressure (defaults to
concurrency * 10), and per-request timeouts. - wreq-powered HTTP client: browser impersonation, redirect following with loop detection, query merging, and proxy support via
request.meta["proxy"]. - Typed spiders and callbacks that can return items or
Requestobjects;HTMLResponseships helper methods plusResponse.followto reuse callbacks. - Middlewares: User-Agent rotation/default, proxy rotation, retry with exponential backoff + optional sleep codes, flexible delays (fixed/random/custom),
SkipNonHTMLMiddlewareto drop non-HTML callbacks, andCloudflareCrawlMiddlewarefor Browser Rendering crawl jobs. - Pipelines: JSON Lines, SQLite, XML (nested data preserved), and CSV (flattens dicts and lists) out of the box.
- Structured logging via
logly(SILKWORM_LOG_LEVEL=DEBUG), plus periodic/final crawl statistics (requests/sec, queue size, memory, seen URLs).
Installation
From PyPI with pip:
From PyPI with uv (recommended for faster installs):
uv pip install silkworm-rs
# or if using uv's project management:
uv add silkworm-rsFrom source:
uv venv # install uv from https://docs.astral.sh/uv/getting-started/ if needed source .venv/bin/activate # Windows: .venv\Scripts\activate uv pip install -e .
Targets Python 3.13+; dependencies are pinned in pyproject.toml.
Quick start
Define a spider by subclassing Spider, implementing parse, and yielding items or follow-up Request objects. This example writes quotes to data/quotes.jl and enables basic user agent, retry, and non-HTML filtering middlewares.
from silkworm import HTMLResponse, Response, Spider, run_spider from silkworm.middlewares import ( RetryMiddleware, SkipNonHTMLMiddleware, UserAgentMiddleware, ) from silkworm.pipelines import JsonLinesPipeline class QuotesSpider(Spider): name = "quotes" start_urls = ("https://quotes.toscrape.com/",) async def parse(self, response: Response): if not isinstance(response, HTMLResponse): return html = response for quote in await html.select(".quote"): text_el = await quote.select_first(".text") author_el = await quote.select_first(".author") if text_el is None or author_el is None: continue tags = await quote.select(".tag") yield { "text": text_el.text, "author": author_el.text, "tags": [t.text for t in tags], } if next_link := await html.select_first("li.next > a"): yield html.follow(next_link.attr("href"), callback=self.parse) if __name__ == "__main__": run_spider( QuotesSpider, request_middlewares=[UserAgentMiddleware()], response_middlewares=[ SkipNonHTMLMiddleware(), RetryMiddleware(max_times=3, sleep_http_codes=[429, 503]), ], item_pipelines=[JsonLinesPipeline("data/quotes.jl")], concurrency=16, request_timeout=10, log_stats_interval=30, )
run_spider/crawl knobs:
concurrency: number of concurrent HTTP requests; default 16.max_pending_requests: queue bound to avoid unbounded memory use (defaults toconcurrency * 10).request_timeout: per-request timeout (seconds).keep_alive: reuse HTTP connections when supported by the underlying client (sendsConnection: keep-alive).html_max_size_bytes: limit HTML parsed intoAsyncDocumentto avoid huge payloads.log_stats_interval: seconds between periodic stats logs; final stats are always emitted.request_middlewares/response_middlewares/item_pipelines: plug-ins run on every request/response/item.- use
run_spider_rsloop(...)instead ofrun_spider(...)to run under rsloop (requirespip install silkworm-rs[rsloop]). - use
run_spider_uvloop(...)instead ofrun_spider(...)to run under uvloop (requirespip install silkworm-rs[uvloop]). - use
run_spider_winloop(...)instead ofrun_spider(...)to run under winloop on Windows (requirespip install silkworm-rs[winloop]).
Built-in middlewares and pipelines
from silkworm.middlewares import ( CloudflareCrawlMiddleware, DelayMiddleware, ProxyMiddleware, RetryMiddleware, SkipNonHTMLMiddleware, UserAgentMiddleware, ) from silkworm.pipelines import ( CallbackPipeline, # invoke a custom callback function on each item CSVPipeline, JsonLinesPipeline, MsgPackPipeline, # requires: pip install silkworm-rs[msgpack] SQLitePipeline, XMLPipeline, TaskiqPipeline, # requires: pip install silkworm-rs[taskiq] PolarsPipeline, # requires: pip install silkworm-rs[polars] ExcelPipeline, # requires: pip install silkworm-rs[excel] YAMLPipeline, # requires: pip install silkworm-rs[yaml] AvroPipeline, # requires: pip install silkworm-rs[avro] ElasticsearchPipeline, # requires: pip install silkworm-rs[elasticsearch] MongoDBPipeline, # requires: pip install silkworm-rs[mongodb] MySQLPipeline, # requires: pip install silkworm-rs[mysql] PostgreSQLPipeline, # requires: pip install silkworm-rs[postgresql] S3JsonLinesPipeline, # requires: pip install silkworm-rs[s3] VortexPipeline, # requires: pip install silkworm-rs[vortex] WebhookPipeline, # sends items to webhook endpoints using wreq GoogleSheetsPipeline, # requires: pip install silkworm-rs[gsheets] SnowflakePipeline, # requires: pip install silkworm-rs[snowflake] FTPPipeline, # requires: pip install silkworm-rs[ftp] SFTPPipeline, # requires: pip install silkworm-rs[sftp] CassandraPipeline, # requires: pip install silkworm-rs[cassandra] CouchDBPipeline, # requires: pip install silkworm-rs[couchdb] DynamoDBPipeline, # requires: pip install silkworm-rs[dynamodb] DuckDBPipeline, # requires: pip install silkworm-rs[duckdb] ) run_spider( QuotesSpider, request_middlewares=[ UserAgentMiddleware(), # rotate/custom user agent DelayMiddleware(min_delay=0.3, max_delay=1.2), # polite throttling # ProxyMiddleware with round-robin selection (default) # ProxyMiddleware(proxies=["http://user:pass@proxy1:8080", "http://proxy2:8080"]), # ProxyMiddleware with random selection # ProxyMiddleware(proxies=["http://proxy1:8080", "http://proxy2:8080"], random_selection=True), # ProxyMiddleware from file with random selection # ProxyMiddleware(proxy_file="proxies.txt", random_selection=True), ], response_middlewares=[ RetryMiddleware(max_times=3, sleep_http_codes=[403, 429]), # backoff + retry SkipNonHTMLMiddleware(), # drop callbacks for images/APIs/etc ], item_pipelines=[ JsonLinesPipeline("data/quotes.jl"), SQLitePipeline("data/quotes.db", table="quotes"), XMLPipeline("data/quotes.xml", root_element="quotes", item_element="quote"), CSVPipeline("data/quotes.csv", fieldnames=["author", "text", "tags"]), MsgPackPipeline("data/quotes.msgpack"), ], )
DelayMiddlewarestrategies:delay=1.0(fixed),min_delay/max_delay(random), ordelay_func(custom).ProxyMiddlewaresupports three modes:- Round-robin (default):
ProxyMiddleware(proxies=["http://proxy1:8080", "http://proxy2:8080"])cycles through proxies in order. - Random selection:
ProxyMiddleware(proxies=["http://proxy1:8080", "http://proxy2:8080"], random_selection=True)randomly selects a proxy for each request. - From file:
ProxyMiddleware(proxy_file="proxies.txt")loads proxies from a file (one proxy per line, blank lines ignored). Combine withrandom_selection=Truefor random selection from the file.
- Round-robin (default):
RetryMiddlewarebacks off withasyncio.sleep; any status insleep_http_codesis retried even if not inretry_http_codes.SkipNonHTMLMiddlewarechecksContent-Typeand optionally sniffs the body (sniff_bytes) to avoid running HTML callbacks on binary/API responses.CloudflareCrawlMiddlewareis opt-in per request viarequest.meta["cloudflare_crawl"]; it submits a Cloudflare Browser Rendering crawl job, polls until completion, and hands your callback a synthetic JSONResponsewith the final API payload.JsonLinesPipelinewrites items to a local JSON Lines file and, whenopendalis installed, appends asynchronously via the filesystem backend (use_opendal=Falseto stick to a regular file handle).CSVPipelineflattens nested dicts (e.g.,{"user": {"name": "Alice"}}->user_name) and joins lists with commas;XMLPipelinepreserves nesting.MsgPackPipelinewrites items in binary MessagePack format using ormsgpack for fast and compact serialization (requirespip install silkworm-rs[msgpack]).TaskiqPipelinesends items to a Taskiq queue for distributed processing (requirespip install silkworm-rs[taskiq]).PolarsPipelinewrites items to a Parquet file using Polars for efficient columnar storage (requirespip install silkworm-rs[polars]).ExcelPipelinewrites items to an Excel .xlsx file (requirespip install silkworm-rs[excel]).YAMLPipelinewrites items to a YAML file (requirespip install silkworm-rs[yaml]).AvroPipelinewrites items to an Avro file with optional schema (requirespip install silkworm-rs[avro]).ElasticsearchPipelinesends items to an Elasticsearch index (requirespip install silkworm-rs[elasticsearch]).MongoDBPipelinesends items to a MongoDB collection (requirespip install silkworm-rs[mongodb]).MySQLPipelinesends items to a MySQL database table as JSON (requirespip install silkworm-rs[mysql]).PostgreSQLPipelinesends items to a PostgreSQL database table as JSONB (requirespip install silkworm-rs[postgresql]).S3JsonLinesPipelinewrites items to AWS S3 in JSON Lines format using async OpenDAL (requirespip install silkworm-rs[s3]).VortexPipelinewrites items to a Vortex file for high-performance columnar storage with 100x faster random access and 10-20x faster scans compared to Parquet (requirespip install silkworm-rs[vortex]).WebhookPipelinesends items to webhook endpoints via HTTP POST/PUT using wreq (same HTTP client as the spider) with support for batching and custom headers.GoogleSheetsPipelineappends items to Google Sheets with automatic flattening of nested data structures (requirespip install silkworm-rs[gsheets]and service account credentials).SnowflakePipelinesends items to Snowflake data warehouse tables as JSON (requirespip install silkworm-rs[snowflake]).FTPPipelinewrites items to an FTP server in JSON Lines format (requirespip install silkworm-rs[ftp]).SFTPPipelinewrites items to an SFTP server in JSON Lines format with support for password or key-based authentication (requirespip install silkworm-rs[sftp]).CassandraPipelinesends items to Apache Cassandra database tables (requirespip install silkworm-rs[cassandra]).CouchDBPipelinesends items to CouchDB databases as documents (requirespip install silkworm-rs[couchdb]).DynamoDBPipelinesends items to AWS DynamoDB tables with automatic table creation (requirespip install silkworm-rs[dynamodb]).DuckDBPipelinesends items to a DuckDB database table as JSON (requirespip install silkworm-rs[duckdb]).CallbackPipelineinvokes a custom callback function (sync or async) on each item, enabling inline processing logic without creating a full pipeline class. See example below.
Using CallbackPipeline for custom processing
Process items with custom callback functions without creating a full pipeline class:
from silkworm.pipelines import CallbackPipeline # Sync callback def print_item(item, spider): print(f"[{spider.name}] {item}") return item # Async callback async def validate_item(item, spider): # Could do async operations like database checks if len(item.get("text", "")) < 10: print(f"Warning: Short text in item") return item # Modifying callback def enrich_item(item, spider): item["spider_name"] = spider.name item["processed"] = True return item run_spider( QuotesSpider, item_pipelines=[ CallbackPipeline(callback=print_item), CallbackPipeline(callback=validate_item), CallbackPipeline(callback=enrich_item), ], )
Callbacks receive (item, spider) and should return the processed item (or None to return the original item unchanged).
Streaming items to a queue with TaskiqPipeline
Stream scraped items to a Taskiq queue for distributed processing:
from taskiq import InMemoryBroker from silkworm.pipelines import TaskiqPipeline broker = InMemoryBroker() @broker.task async def process_item(item): # Your item processing logic here print(f"Processing: {item}") # Save to database, send to another service, etc. pipeline = TaskiqPipeline(broker, task=process_item) run_spider(MySpider, item_pipelines=[pipeline])
This enables distributed processing, retries, rate limiting, and other Taskiq features. See examples/taskiq_quotes_spider.py for a complete example.
Handling non-HTML responses
Keep crawls cheap when URLs mix HTML and binaries/APIs:
response_middlewares=[SkipNonHTMLMiddleware(sniff_bytes=1024)] # Tighten HTML parsing size (bytes) to avoid loading huge bodies into scraper-rs run_spider(MySpider, html_max_size_bytes=1_000_000)
Performance optimization with rsloop
For improved async performance, enable rsloop as a drop-in replacement for asyncio's event loop:
pip install silkworm-rs[rsloop]
# or with uv:
uv pip install silkworm-rs[rsloop]Then call run_spider_rsloop (same signature as run_spider):
from silkworm import run_spider_rsloop run_spider_rsloop( QuotesSpider, concurrency=32, )
Performance optimization with uvloop
For improved async performance, enable uvloop (a fast, drop-in replacement for asyncio's event loop):
pip install silkworm-rs[uvloop]
# or with uv:
uv pip install silkworm-rs[uvloop]Then call run_spider_uvloop (same signature as run_spider):
from silkworm import run_spider_uvloop run_spider_uvloop( QuotesSpider, concurrency=32, )
uvloop can provide 2-4x performance improvement for I/O-bound workloads.
Performance optimization with winloop (Windows)
For Windows users who want improved async performance, enable winloop (a Windows-compatible alternative to uvloop):
pip install silkworm-rs[winloop]
# or with uv:
uv pip install silkworm-rs[winloop]Then call run_spider_winloop (same signature as run_spider):
from silkworm import run_spider_winloop run_spider_winloop( QuotesSpider, concurrency=32, )
winloop provides significant performance improvements on Windows, similar to what uvloop offers on Unix-like systems.
Running spiders with trio
If you prefer trio over asyncio, you can use run_spider_trio instead of run_spider:
pip install silkworm-rs[trio]
# or with uv:
uv pip install silkworm-rs[trio]Then use run_spider_trio:
from silkworm import run_spider_trio run_spider_trio( QuotesSpider, concurrency=16, request_timeout=10, )
This runs your spider using trio as the async backend via trio-asyncio compatibility layer.
JavaScript rendering with Lightpanda (CDP)
For pages that require JavaScript execution, you can use Lightpanda (or any CDP-compatible browser) instead of the standard HTTP client. This uses the Chrome DevTools Protocol (CDP) to control a browser.
Installation
pip install silkworm-rs[cdp]
# or with uv:
uv pip install silkworm-rs[cdp]Starting Lightpanda
lightpanda --remote-debugging-port=9222
Or use Chrome/Chromium:
chromium --remote-debugging-port=9222 --headless
Using CDP in your spider
There are two ways to use CDP: the convenience API or custom spider integration.
Convenience API (simple one-off fetches)
import asyncio from silkworm import fetch_html_cdp async def main(): # Fetch HTML with JavaScript rendering text, doc = await fetch_html_cdp( "https://example.com", ws_endpoint="ws://127.0.0.1:9222", timeout=30.0 ) # Extract data from rendered page title = doc.select_first("title") print(title.text if title else "No title") asyncio.run(main())
Full Spider Integration
from silkworm import HTMLResponse, Request, Response, Spider from silkworm.cdp import CDPClient class LightpandaSpider(Spider): name = "lightpanda" start_urls = ("https://example.com/",) def __init__(self, **kwargs): super().__init__(**kwargs) self._cdp_client = None async def start_requests(self): # Connect to CDP endpoint self._cdp_client = CDPClient( ws_endpoint="ws://127.0.0.1:9222", timeout=30.0 ) await self._cdp_client.connect() for url in self.start_urls: yield Request(url=url, callback=self.parse) async def parse(self, response: Response): if not isinstance(response, HTMLResponse): return # Extract links from JavaScript-rendered page for link in await response.select("a"): href = link.attr("href") if href: yield {"url": href} async def close(self): if self._cdp_client: await self._cdp_client.close()
See examples/lightpanda_simple.py and examples/lightpanda_spider.py for complete working examples.
Note: CDP support is experimental. For production use, consider using dedicated browser automation tools or the standard HTTP client when JavaScript rendering is not required.
Logging and crawl statistics
- Structured logs via
logly; setSILKWORM_LOG_LEVEL=DEBUGfor verbose request/response/middleware output. - Periodic statistics with
log_stats_interval; final stats always include elapsed time, queue size, requests/sec, seen URLs, items scraped, errors, and memory MB.
Limitations
- By default, HTTP fetches are wreq-based without JavaScript execution; pages requiring client-side rendering can use the optional CDP integration (see "JavaScript rendering with Lightpanda" section) or external browser automation tools.
- Request deduplication keys only on
Request.url; query params, HTTP method, and body are ignored, so same-URL requests with different params/data are dropped unless you setdont_filter=Trueor make the URL unique yourself. - HTML parsing auto-detects encoding (BOM, HTTP headers/meta, charset detection fallback) but still enforces a
html_max_size_bytes/doc_max_size_bytescap (default 5 MB) inscraper-rsselectors, so very large pages may need a higher limit or preprocessing. - Several pipelines buffer all items in memory until close (PolarsPipeline, ExcelPipeline, YAMLPipeline, AvroPipeline, VortexPipeline, S3JsonLinesPipeline, FTPPipeline, SFTPPipeline), which can bloat RAM on long crawls; prefer streaming pipelines like JsonLines/CSV/SQLite for high-volume runs.
- Many destination pipelines rely on optional extras; CassandraPipeline is disabled on Windows because
cassandra-driverdepends on libev there.
Examples
python examples/quotes_spider.py→data/quotes.jlpython examples/quotes_spider_trio.py→data/quotes_trio.jl(demonstrates trio backend)python examples/quotes_spider_winloop.py→data/quotes_winloop.jl(demonstrates winloop backend for Windows)python examples/hackernews_spider.py --pages 5→data/hackernews.jlpython examples/lobsters_spider.py --pages 2→data/lobsters.jlpython examples/url_titles_spider.py --urls-file data/url_titles.jl --output data/titles.jl(includesSkipNonHTMLMiddlewareand stricter HTML size limits)python examples/export_formats_demo.py --pages 2→ JSONL, XML, and CSV outputs indata/python examples/taskiq_quotes_spider.py --pages 2→ demonstrates TaskiqPipeline for queue-based processingpython examples/sitemap_spider.py --sitemap-url https://example.com/sitemap.xml --pages 50→data/sitemap_meta.jl(extracts meta tags and Open Graph data from sitemap URLs)python examples/lightpanda_simple.py→ demonstrates CDP/Lightpanda for JavaScript rendering (requirespip install silkworm-rs[cdp]and running Lightpanda)python examples/lightpanda_spider.py→ full spider example using CDP/Lightpanda
Convenience API
For one-off fetches without a full spider:
Standard HTTP fetch
import asyncio from silkworm import fetch_html async def main(): text, doc = await fetch_html("https://example.com") title = await doc.select_first("title") print(title.text if title else "No title") asyncio.run(main())
CDP-based fetch (with JavaScript rendering)
import asyncio from silkworm import fetch_html_cdp async def main(): # Requires Lightpanda/Chrome running with CDP enabled text, doc = await fetch_html_cdp("https://example.com") title = await doc.select_first("title") print(title.text if title else "No title") asyncio.run(main())
Contributing
Pull requests and issues are welcome. To set up a dev environment, install uv, create a Python 3.13 virtualenv, and sync dev dependencies:
uv venv --python python3.13 uv sync --group dev
Run the checks before opening a PR:
just fmt && just lint && just typecheck && just test
Acknowledgements
Silkworm is built on top of excellent open-source projects:
wreq- HTTP client with browser impersonation capabilities- scraper-rs - Fast HTML parsing library
- logly - Structured logging
- rxml - XML parsing and writing
We are grateful to the maintainers and contributors of these projects for their work.
License
MIT License. See LICENSE for details.