GitHub - BitingSnakes/silkworm: Async web scraping framework on top of Rust. Works with Free-threaded Python (`PYTHON_GIL=0`).

12 min read Original article ↗

PyPI - Version Tests PyPI Downloads

Async-first web scraping framework built on wreq (HTTP with browser impersonation) and scraper-rs (fast HTML parsing). Silkworm gives you a minimal Spider/Request/Response model, middlewares, and pipelines so you can script quick scrapes or build larger crawlers without boilerplate.

NEW: Use silkworm-mcp to build scrapers.

Features

  • Async engine with configurable concurrency, bounded queue backpressure (defaults to concurrency * 10), and per-request timeouts.
  • wreq-powered HTTP client: browser impersonation, redirect following with loop detection, query merging, and proxy support via request.meta["proxy"].
  • Typed spiders and callbacks that can return items or Request objects; HTMLResponse ships helper methods plus Response.follow to reuse callbacks.
  • Middlewares: User-Agent rotation/default, proxy rotation, retry with exponential backoff + optional sleep codes, flexible delays (fixed/random/custom), SkipNonHTMLMiddleware to drop non-HTML callbacks, and CloudflareCrawlMiddleware for Browser Rendering crawl jobs.
  • Pipelines: JSON Lines, SQLite, XML (nested data preserved), and CSV (flattens dicts and lists) out of the box.
  • Structured logging via logly (SILKWORM_LOG_LEVEL=DEBUG), plus periodic/final crawl statistics (requests/sec, queue size, memory, seen URLs).

Installation

From PyPI with pip:

From PyPI with uv (recommended for faster installs):

uv pip install silkworm-rs
# or if using uv's project management:
uv add silkworm-rs

From source:

uv venv  # install uv from https://docs.astral.sh/uv/getting-started/ if needed
source .venv/bin/activate  # Windows: .venv\Scripts\activate
uv pip install -e .

Targets Python 3.13+; dependencies are pinned in pyproject.toml.

Quick start

Define a spider by subclassing Spider, implementing parse, and yielding items or follow-up Request objects. This example writes quotes to data/quotes.jl and enables basic user agent, retry, and non-HTML filtering middlewares.

from silkworm import HTMLResponse, Response, Spider, run_spider
from silkworm.middlewares import (
    RetryMiddleware,
    SkipNonHTMLMiddleware,
    UserAgentMiddleware,
)
from silkworm.pipelines import JsonLinesPipeline


class QuotesSpider(Spider):
    name = "quotes"
    start_urls = ("https://quotes.toscrape.com/",)

    async def parse(self, response: Response):
        if not isinstance(response, HTMLResponse):
            return

        html = response
        for quote in await html.select(".quote"):
            text_el = await quote.select_first(".text")
            author_el = await quote.select_first(".author")
            if text_el is None or author_el is None:
                continue
            tags = await quote.select(".tag")
            yield {
                "text": text_el.text,
                "author": author_el.text,
                "tags": [t.text for t in tags],
            }

        if next_link := await html.select_first("li.next > a"):
            yield html.follow(next_link.attr("href"), callback=self.parse)


if __name__ == "__main__":
    run_spider(
        QuotesSpider,
        request_middlewares=[UserAgentMiddleware()],
        response_middlewares=[
            SkipNonHTMLMiddleware(),
            RetryMiddleware(max_times=3, sleep_http_codes=[429, 503]),
        ],
        item_pipelines=[JsonLinesPipeline("data/quotes.jl")],
        concurrency=16,
        request_timeout=10,
        log_stats_interval=30,
    )

run_spider/crawl knobs:

  • concurrency: number of concurrent HTTP requests; default 16.
  • max_pending_requests: queue bound to avoid unbounded memory use (defaults to concurrency * 10).
  • request_timeout: per-request timeout (seconds).
  • keep_alive: reuse HTTP connections when supported by the underlying client (sends Connection: keep-alive).
  • html_max_size_bytes: limit HTML parsed into AsyncDocument to avoid huge payloads.
  • log_stats_interval: seconds between periodic stats logs; final stats are always emitted.
  • request_middlewares / response_middlewares / item_pipelines: plug-ins run on every request/response/item.
  • use run_spider_rsloop(...) instead of run_spider(...) to run under rsloop (requires pip install silkworm-rs[rsloop]).
  • use run_spider_uvloop(...) instead of run_spider(...) to run under uvloop (requires pip install silkworm-rs[uvloop]).
  • use run_spider_winloop(...) instead of run_spider(...) to run under winloop on Windows (requires pip install silkworm-rs[winloop]).

Built-in middlewares and pipelines

from silkworm.middlewares import (
    CloudflareCrawlMiddleware,
    DelayMiddleware,
    ProxyMiddleware,
    RetryMiddleware,
    SkipNonHTMLMiddleware,
    UserAgentMiddleware,
)
from silkworm.pipelines import (
    CallbackPipeline,  # invoke a custom callback function on each item
    CSVPipeline,
    JsonLinesPipeline,
    MsgPackPipeline,  # requires: pip install silkworm-rs[msgpack]
    SQLitePipeline,
    XMLPipeline,
    TaskiqPipeline,  # requires: pip install silkworm-rs[taskiq]
    PolarsPipeline,  # requires: pip install silkworm-rs[polars]
    ExcelPipeline,  # requires: pip install silkworm-rs[excel]
    YAMLPipeline,  # requires: pip install silkworm-rs[yaml]
    AvroPipeline,  # requires: pip install silkworm-rs[avro]
    ElasticsearchPipeline,  # requires: pip install silkworm-rs[elasticsearch]
    MongoDBPipeline,  # requires: pip install silkworm-rs[mongodb]
    MySQLPipeline,  # requires: pip install silkworm-rs[mysql]
    PostgreSQLPipeline,  # requires: pip install silkworm-rs[postgresql]
    S3JsonLinesPipeline,  # requires: pip install silkworm-rs[s3]
    VortexPipeline,  # requires: pip install silkworm-rs[vortex]
    WebhookPipeline,  # sends items to webhook endpoints using wreq
    GoogleSheetsPipeline,  # requires: pip install silkworm-rs[gsheets]
    SnowflakePipeline,  # requires: pip install silkworm-rs[snowflake]
    FTPPipeline,  # requires: pip install silkworm-rs[ftp]
    SFTPPipeline,  # requires: pip install silkworm-rs[sftp]
    CassandraPipeline,  # requires: pip install silkworm-rs[cassandra]
    CouchDBPipeline,  # requires: pip install silkworm-rs[couchdb]
    DynamoDBPipeline,  # requires: pip install silkworm-rs[dynamodb]
    DuckDBPipeline,  # requires: pip install silkworm-rs[duckdb]
)

run_spider(
    QuotesSpider,
    request_middlewares=[
        UserAgentMiddleware(),  # rotate/custom user agent
        DelayMiddleware(min_delay=0.3, max_delay=1.2),  # polite throttling
        # ProxyMiddleware with round-robin selection (default)
        # ProxyMiddleware(proxies=["http://user:pass@proxy1:8080", "http://proxy2:8080"]),
        # ProxyMiddleware with random selection
        # ProxyMiddleware(proxies=["http://proxy1:8080", "http://proxy2:8080"], random_selection=True),
        # ProxyMiddleware from file with random selection
        # ProxyMiddleware(proxy_file="proxies.txt", random_selection=True),
    ],
    response_middlewares=[
        RetryMiddleware(max_times=3, sleep_http_codes=[403, 429]),  # backoff + retry
        SkipNonHTMLMiddleware(),  # drop callbacks for images/APIs/etc
    ],
    item_pipelines=[
        JsonLinesPipeline("data/quotes.jl"),
        SQLitePipeline("data/quotes.db", table="quotes"),
        XMLPipeline("data/quotes.xml", root_element="quotes", item_element="quote"),
        CSVPipeline("data/quotes.csv", fieldnames=["author", "text", "tags"]),
        MsgPackPipeline("data/quotes.msgpack"),
    ],
)
  • DelayMiddleware strategies: delay=1.0 (fixed), min_delay/max_delay (random), or delay_func (custom).
  • ProxyMiddleware supports three modes:
    • Round-robin (default): ProxyMiddleware(proxies=["http://proxy1:8080", "http://proxy2:8080"]) cycles through proxies in order.
    • Random selection: ProxyMiddleware(proxies=["http://proxy1:8080", "http://proxy2:8080"], random_selection=True) randomly selects a proxy for each request.
    • From file: ProxyMiddleware(proxy_file="proxies.txt") loads proxies from a file (one proxy per line, blank lines ignored). Combine with random_selection=True for random selection from the file.
  • RetryMiddleware backs off with asyncio.sleep; any status in sleep_http_codes is retried even if not in retry_http_codes.
  • SkipNonHTMLMiddleware checks Content-Type and optionally sniffs the body (sniff_bytes) to avoid running HTML callbacks on binary/API responses.
  • CloudflareCrawlMiddleware is opt-in per request via request.meta["cloudflare_crawl"]; it submits a Cloudflare Browser Rendering crawl job, polls until completion, and hands your callback a synthetic JSON Response with the final API payload.
  • JsonLinesPipeline writes items to a local JSON Lines file and, when opendal is installed, appends asynchronously via the filesystem backend (use_opendal=False to stick to a regular file handle).
  • CSVPipeline flattens nested dicts (e.g., {"user": {"name": "Alice"}} -> user_name) and joins lists with commas; XMLPipeline preserves nesting.
  • MsgPackPipeline writes items in binary MessagePack format using ormsgpack for fast and compact serialization (requires pip install silkworm-rs[msgpack]).
  • TaskiqPipeline sends items to a Taskiq queue for distributed processing (requires pip install silkworm-rs[taskiq]).
  • PolarsPipeline writes items to a Parquet file using Polars for efficient columnar storage (requires pip install silkworm-rs[polars]).
  • ExcelPipeline writes items to an Excel .xlsx file (requires pip install silkworm-rs[excel]).
  • YAMLPipeline writes items to a YAML file (requires pip install silkworm-rs[yaml]).
  • AvroPipeline writes items to an Avro file with optional schema (requires pip install silkworm-rs[avro]).
  • ElasticsearchPipeline sends items to an Elasticsearch index (requires pip install silkworm-rs[elasticsearch]).
  • MongoDBPipeline sends items to a MongoDB collection (requires pip install silkworm-rs[mongodb]).
  • MySQLPipeline sends items to a MySQL database table as JSON (requires pip install silkworm-rs[mysql]).
  • PostgreSQLPipeline sends items to a PostgreSQL database table as JSONB (requires pip install silkworm-rs[postgresql]).
  • S3JsonLinesPipeline writes items to AWS S3 in JSON Lines format using async OpenDAL (requires pip install silkworm-rs[s3]).
  • VortexPipeline writes items to a Vortex file for high-performance columnar storage with 100x faster random access and 10-20x faster scans compared to Parquet (requires pip install silkworm-rs[vortex]).
  • WebhookPipeline sends items to webhook endpoints via HTTP POST/PUT using wreq (same HTTP client as the spider) with support for batching and custom headers.
  • GoogleSheetsPipeline appends items to Google Sheets with automatic flattening of nested data structures (requires pip install silkworm-rs[gsheets] and service account credentials).
  • SnowflakePipeline sends items to Snowflake data warehouse tables as JSON (requires pip install silkworm-rs[snowflake]).
  • FTPPipeline writes items to an FTP server in JSON Lines format (requires pip install silkworm-rs[ftp]).
  • SFTPPipeline writes items to an SFTP server in JSON Lines format with support for password or key-based authentication (requires pip install silkworm-rs[sftp]).
  • CassandraPipeline sends items to Apache Cassandra database tables (requires pip install silkworm-rs[cassandra]).
  • CouchDBPipeline sends items to CouchDB databases as documents (requires pip install silkworm-rs[couchdb]).
  • DynamoDBPipeline sends items to AWS DynamoDB tables with automatic table creation (requires pip install silkworm-rs[dynamodb]).
  • DuckDBPipeline sends items to a DuckDB database table as JSON (requires pip install silkworm-rs[duckdb]).
  • CallbackPipeline invokes a custom callback function (sync or async) on each item, enabling inline processing logic without creating a full pipeline class. See example below.

Using CallbackPipeline for custom processing

Process items with custom callback functions without creating a full pipeline class:

from silkworm.pipelines import CallbackPipeline

# Sync callback
def print_item(item, spider):
    print(f"[{spider.name}] {item}")
    return item

# Async callback
async def validate_item(item, spider):
    # Could do async operations like database checks
    if len(item.get("text", "")) < 10:
        print(f"Warning: Short text in item")
    return item

# Modifying callback
def enrich_item(item, spider):
    item["spider_name"] = spider.name
    item["processed"] = True
    return item

run_spider(
    QuotesSpider,
    item_pipelines=[
        CallbackPipeline(callback=print_item),
        CallbackPipeline(callback=validate_item),
        CallbackPipeline(callback=enrich_item),
    ],
)

Callbacks receive (item, spider) and should return the processed item (or None to return the original item unchanged).

Streaming items to a queue with TaskiqPipeline

Stream scraped items to a Taskiq queue for distributed processing:

from taskiq import InMemoryBroker
from silkworm.pipelines import TaskiqPipeline

broker = InMemoryBroker()

@broker.task
async def process_item(item):
    # Your item processing logic here
    print(f"Processing: {item}")
    # Save to database, send to another service, etc.

pipeline = TaskiqPipeline(broker, task=process_item)
run_spider(MySpider, item_pipelines=[pipeline])

This enables distributed processing, retries, rate limiting, and other Taskiq features. See examples/taskiq_quotes_spider.py for a complete example.

Handling non-HTML responses

Keep crawls cheap when URLs mix HTML and binaries/APIs:

response_middlewares=[SkipNonHTMLMiddleware(sniff_bytes=1024)]
# Tighten HTML parsing size (bytes) to avoid loading huge bodies into scraper-rs
run_spider(MySpider, html_max_size_bytes=1_000_000)

Performance optimization with rsloop

For improved async performance, enable rsloop as a drop-in replacement for asyncio's event loop:

pip install silkworm-rs[rsloop]
# or with uv:
uv pip install silkworm-rs[rsloop]

Then call run_spider_rsloop (same signature as run_spider):

from silkworm import run_spider_rsloop

run_spider_rsloop(
    QuotesSpider,
    concurrency=32,
)

Performance optimization with uvloop

For improved async performance, enable uvloop (a fast, drop-in replacement for asyncio's event loop):

pip install silkworm-rs[uvloop]
# or with uv:
uv pip install silkworm-rs[uvloop]

Then call run_spider_uvloop (same signature as run_spider):

from silkworm import run_spider_uvloop

run_spider_uvloop(
    QuotesSpider,
    concurrency=32,
)

uvloop can provide 2-4x performance improvement for I/O-bound workloads.

Performance optimization with winloop (Windows)

For Windows users who want improved async performance, enable winloop (a Windows-compatible alternative to uvloop):

pip install silkworm-rs[winloop]
# or with uv:
uv pip install silkworm-rs[winloop]

Then call run_spider_winloop (same signature as run_spider):

from silkworm import run_spider_winloop

run_spider_winloop(
    QuotesSpider,
    concurrency=32,
)

winloop provides significant performance improvements on Windows, similar to what uvloop offers on Unix-like systems.

Running spiders with trio

If you prefer trio over asyncio, you can use run_spider_trio instead of run_spider:

pip install silkworm-rs[trio]
# or with uv:
uv pip install silkworm-rs[trio]

Then use run_spider_trio:

from silkworm import run_spider_trio

run_spider_trio(
    QuotesSpider,
    concurrency=16,
    request_timeout=10,
)

This runs your spider using trio as the async backend via trio-asyncio compatibility layer.

JavaScript rendering with Lightpanda (CDP)

For pages that require JavaScript execution, you can use Lightpanda (or any CDP-compatible browser) instead of the standard HTTP client. This uses the Chrome DevTools Protocol (CDP) to control a browser.

Installation

pip install silkworm-rs[cdp]
# or with uv:
uv pip install silkworm-rs[cdp]

Starting Lightpanda

lightpanda --remote-debugging-port=9222

Or use Chrome/Chromium:

chromium --remote-debugging-port=9222 --headless

Using CDP in your spider

There are two ways to use CDP: the convenience API or custom spider integration.

Convenience API (simple one-off fetches)

import asyncio
from silkworm import fetch_html_cdp

async def main():
    # Fetch HTML with JavaScript rendering
    text, doc = await fetch_html_cdp(
        "https://example.com",
        ws_endpoint="ws://127.0.0.1:9222",
        timeout=30.0
    )
    
    # Extract data from rendered page
    title = doc.select_first("title")
    print(title.text if title else "No title")

asyncio.run(main())

Full Spider Integration

from silkworm import HTMLResponse, Request, Response, Spider
from silkworm.cdp import CDPClient

class LightpandaSpider(Spider):
    name = "lightpanda"
    start_urls = ("https://example.com/",)

    def __init__(self, **kwargs):
        super().__init__(**kwargs)
        self._cdp_client = None

    async def start_requests(self):
        # Connect to CDP endpoint
        self._cdp_client = CDPClient(
            ws_endpoint="ws://127.0.0.1:9222",
            timeout=30.0
        )
        await self._cdp_client.connect()
        
        for url in self.start_urls:
            yield Request(url=url, callback=self.parse)

    async def parse(self, response: Response):
        if not isinstance(response, HTMLResponse):
            return
        
        # Extract links from JavaScript-rendered page
        for link in await response.select("a"):
            href = link.attr("href")
            if href:
                yield {"url": href}

    async def close(self):
        if self._cdp_client:
            await self._cdp_client.close()

See examples/lightpanda_simple.py and examples/lightpanda_spider.py for complete working examples.

Note: CDP support is experimental. For production use, consider using dedicated browser automation tools or the standard HTTP client when JavaScript rendering is not required.

Logging and crawl statistics

  • Structured logs via logly; set SILKWORM_LOG_LEVEL=DEBUG for verbose request/response/middleware output.
  • Periodic statistics with log_stats_interval; final stats always include elapsed time, queue size, requests/sec, seen URLs, items scraped, errors, and memory MB.

Limitations

  • By default, HTTP fetches are wreq-based without JavaScript execution; pages requiring client-side rendering can use the optional CDP integration (see "JavaScript rendering with Lightpanda" section) or external browser automation tools.
  • Request deduplication keys only on Request.url; query params, HTTP method, and body are ignored, so same-URL requests with different params/data are dropped unless you set dont_filter=True or make the URL unique yourself.
  • HTML parsing auto-detects encoding (BOM, HTTP headers/meta, charset detection fallback) but still enforces a html_max_size_bytes/doc_max_size_bytes cap (default 5 MB) in scraper-rs selectors, so very large pages may need a higher limit or preprocessing.
  • Several pipelines buffer all items in memory until close (PolarsPipeline, ExcelPipeline, YAMLPipeline, AvroPipeline, VortexPipeline, S3JsonLinesPipeline, FTPPipeline, SFTPPipeline), which can bloat RAM on long crawls; prefer streaming pipelines like JsonLines/CSV/SQLite for high-volume runs.
  • Many destination pipelines rely on optional extras; CassandraPipeline is disabled on Windows because cassandra-driver depends on libev there.

Examples

  • python examples/quotes_spider.pydata/quotes.jl
  • python examples/quotes_spider_trio.pydata/quotes_trio.jl (demonstrates trio backend)
  • python examples/quotes_spider_winloop.pydata/quotes_winloop.jl (demonstrates winloop backend for Windows)
  • python examples/hackernews_spider.py --pages 5data/hackernews.jl
  • python examples/lobsters_spider.py --pages 2data/lobsters.jl
  • python examples/url_titles_spider.py --urls-file data/url_titles.jl --output data/titles.jl (includes SkipNonHTMLMiddleware and stricter HTML size limits)
  • python examples/export_formats_demo.py --pages 2 → JSONL, XML, and CSV outputs in data/
  • python examples/taskiq_quotes_spider.py --pages 2 → demonstrates TaskiqPipeline for queue-based processing
  • python examples/sitemap_spider.py --sitemap-url https://example.com/sitemap.xml --pages 50data/sitemap_meta.jl (extracts meta tags and Open Graph data from sitemap URLs)
  • python examples/lightpanda_simple.py → demonstrates CDP/Lightpanda for JavaScript rendering (requires pip install silkworm-rs[cdp] and running Lightpanda)
  • python examples/lightpanda_spider.py → full spider example using CDP/Lightpanda

Convenience API

For one-off fetches without a full spider:

Standard HTTP fetch

import asyncio
from silkworm import fetch_html

async def main():
    text, doc = await fetch_html("https://example.com")
    title = await doc.select_first("title")
    print(title.text if title else "No title")

asyncio.run(main())

CDP-based fetch (with JavaScript rendering)

import asyncio
from silkworm import fetch_html_cdp

async def main():
    # Requires Lightpanda/Chrome running with CDP enabled
    text, doc = await fetch_html_cdp("https://example.com")
    title = await doc.select_first("title")
    print(title.text if title else "No title")

asyncio.run(main())

Contributing

Pull requests and issues are welcome. To set up a dev environment, install uv, create a Python 3.13 virtualenv, and sync dev dependencies:

uv venv --python python3.13
uv sync --group dev

Run the checks before opening a PR:

just fmt && just lint && just typecheck && just test

Acknowledgements

Silkworm is built on top of excellent open-source projects:

  • wreq - HTTP client with browser impersonation capabilities
  • scraper-rs - Fast HTML parsing library
  • logly - Structured logging
  • rxml - XML parsing and writing

We are grateful to the maintainers and contributors of these projects for their work.

License

MIT License. See LICENSE for details.