GitHub - promptware/readweb: read_web LLM tool that cleans up the page contents intelligently, saving tokens for your LLM workflow

Smart HTML → Markdown Scraper

A specialized pipeline for extracting clean, token-efficient markdown from websites.

Problem

Naive HTML -> Markdown conversion produces a ton of garbage that wastes tokens and pollutes LLM workflows. Typical noise includes:

Navigation panels
Popups
Cookie consent banners
Table of contents
Headers / footers

Solution

This project implements three pipelines:

"Page preset" generation: HTML -> Preset:

type Preset = {
    // anchors to make this preset more fragile on purpose.
    // Elements that identify website engine layout go here.
    preset_match_detectors: CSSSelector[];
    // main content extractors
    main_content_selectors: CSSSelector[];
    // filter selectors to trim the main content.
    // banners, subscription forms, sponsor content
    main_content_filters: CSSSelector[];
};

type CSSSelector = string;

Preset generation uses a feedback loop that enhances + applies preset until the markdown is really clean.

Applying page preset: Preset + HTML -> Markdown
Programmatic mozilla/readability (a.k.a. "reader mode") as HTML -> markdown API. Just for comparison with how far we can get with naive heuristics on the modern web.

Try it

I deployed a demo for you to try: https://readweb.osint.moe/ (temporary - it may run out of firecrawl credits).

It compares these methods side by side:

our preset generation flow
Firecrawl URL -> markdown
literal HTML -> markdown (similar to Firecrawl, but not exactly the same)
Mozilla's Readability (reader mode)

To run the demo by yourself,

Populate .env (see .env.example). Firecrawl is used for HTML fetching
pnpm install
pnpm run start:web