Settings

Theme

Hidden In HTML: Parsing Page Layouts. 2.9B Web Page Analysis

webparsing.io

5 points by benwills a year ago · 1 comment

Reader

benwillsOP a year ago

This is an analysis I put together of the November 2024 Common Crawl HTML/Warc dataset. I counted HTML tag attribute values to identify the most common values per tag+attribute combination. I've done this analysis several times over the years and have found it to be invaluable when it comes to writing parsers.

The post is interactive, allowing you to search on the 500 most common values per tag+attribute. There is also a free SQLite database available for download of the top 1,000 values per tag+attribute.

This is the first post of an 8-part series that builds toward writing an article parser, the lessons from which can be transferred to writing any other kind of parser you might want.

This is my first time to publish content like this and I'd love any feedback you might have.

Keyboard Shortcuts

j
Next item
k
Previous item
o / Enter
Open selected item
?
Show this help
Esc
Close modal / clear selection