Smart HTML → Markdown Scraper
A specialized pipeline for extracting clean, token-efficient markdown from websites.
Problem
Naive HTML -> Markdown conversion produces a ton of garbage that wastes tokens and pollutes LLM workflows. Typical noise includes:
- Navigation panels
- Popups
- Cookie consent banners
- Table of contents
- Headers / footers
Solution
This project implements three pipelines:
- "Page preset" generation: HTML -> Preset:
type Preset = { // anchors to make this preset more fragile on purpose. // Elements that identify website engine layout go here. preset_match_detectors: CSSSelector[]; // main content extractors main_content_selectors: CSSSelector[]; // filter selectors to trim the main content. // banners, subscription forms, sponsor content main_content_filters: CSSSelector[]; }; type CSSSelector = string;
Preset generation uses a feedback loop that enhances + applies preset until the markdown is really clean.
-
Applying page preset: Preset + HTML -> Markdown
-
Programmatic mozilla/readability (a.k.a. "reader mode") as HTML -> markdown API. Just for comparison with how far we can get with naive heuristics on the modern web.
Try it
I deployed a demo for you to try: https://readweb.osint.moe/ (temporary - it may run out of firecrawl credits).
It compares these methods side by side:
- our preset generation flow
- Firecrawl URL -> markdown
- literal HTML -> markdown (similar to Firecrawl, but not exactly the same)
- Mozilla's Readability (reader mode)
To run the demo by yourself,
- Populate
.env(see .env.example). Firecrawl is used for HTML fetching pnpm installpnpm run start:web
