Oh Yes You Can Use Regexes to Parse HTML

38 points by luuuzeta 3 years ago · 21 comments

Reader

So he is using a full blown parser, but some part of the tokenisation is done with regexes.

I call BS.

Also I'm pretty sure it will miss some nesting of "<", somewhere, in an attribute, cdata, js, etc, that is not a tag, but will confuse the parser.

I used regexes to parse HTML, it works fine for quick and dirty scripts that need a small chunk of data for a limited sample of pages. Which I believe is the message he is trying to convey.

But I'd rather keep the legend of the infamous SO post against parsing HTML because:

- it will help the people that need it the most to avoid making mistakes

- it's fun, and part of our culture.

im3w1l 3 years ago
I have fun story about this. Once I was trying to get data out of this one API that served XML. First I wrote a solution using regexes. Because of confusion elsewhere in the thread, I want to really clarify that I didn't parse the whole thing with one big regex. But neither were they use merely for tokenization. Somewhere in between. It had stuff like this (from memory may not actually be valid regex)
```
  <someelement attribute1=\"([^"]+)\" attribute2=\"([^"]+)\"/>
```
It worked perfectly. Then I heard that parsing with regex was a bad thing and you should use a proper parser. It worked for a short time until I got an error about invalid xml. See one of the attributes contained a heart "<3" - this is actually not allowed in xml! It has to be escaped even in attributes. I went back to the regex solution, and it kept chugging along for years on their invalid xml.

Name_Chawps 3 years ago

This "uses" regexes to parse HTML in the same way that Sunny D is "made with" 100% orange juice.

egberts1 3 years ago

I know of no Regex pattern that can handle all the old and new HTML as well as HTML5: believe me, as one who is looking to put HTML parser on FPGA/ASIC for higher speed, I've actually forayed down this rabbit hole a few times in the fruitless pursuit of identifying this elusive pure Regex pattern for HTML, et. al. Problem is in Regex's lack of support for multiple state machine and its needed interactions between these state machines.

The language Perl came closest to the smallest HTML parser.

Things to do before doing simplistic regex on HTML using some multiple passes of Regex are probably required, probably in order of (my 20yo memory failing here):

- de-CDATA

- De-pairing of quotes

- De-symbolization of HTML symbols, entities. and codes (de-escaping)

- lone unterminated </> (ie. <p>)

Before you can even hit up for pairing of <XXX> and </XXX> and getting to its HTML tags and attributes.

In short, additional scripting is required to conduct the applying of multiple Regex patterns before one can even be getting into properly parsing the HTML.

Simplest that I've gotten is using both bash logic and Regex, but it fails on certain HTML codes.

Federico Tommassetti, well-renown expert on domain specific languages and transpiliers, covers nearly all the valid libraries of many modern languages for just the parsing of HTML.

Federico makes it easier for first timer of HTML parser coding to that that first step: selecting an HTML parser library.

https://tomassetti.me/parsing-html/

stevefan1999 3 years ago

Regex if extended can go as far as Turing-complete
Meanwhile regular expression (the OG Regex) is just an NFA and should be easier to implement in circuit. The problem is an NFA circuit still needs exponential expansion (if minimized to DFA which is just power set of encoding and eliminating possible NFA states), and with Turing complete Regex you have halting problem -- both are hellish to solve unless P=NP
- egberts1 3 years ago
  
  Yeah. Doing at firmware-level this Regex-stack-tracking of multiple but separate data streams requires some CPU assist for this "halting" problem.

jerf 3 years ago

First, there's the obvious problem of failing to distinguish between "parsing" and merely "tokenizing". The latter was generally possible. In fact, IIRC, the famous Zalgo rant (linked in the post), while fun and true in a sense, is actually posted to a bad question for it, as the question asked is actually perfectly solvable by regular expressions, even conventional ones without backwards matches or any other fancy PCRE additions.

However, I'm not even sure that you can any longer even tokenize HTML with regular expressions, because one of the most important aspects of HTML5 was to formalize a strict definition of how to sloppily parse HTML. Yes, that may sound like a contradiction, but it isn't, check the sentence again. It formalized what the browsers were already doing and harmonized how to handle the broken HTML that people actually produce. As one might expect from something that is the harmonization of the decade+ accumulation of the heuristics developed by at least three major streams of browsers (more depending on how you count), it is not exactly simple.

I guess I can't guarantee you couldn't embed all this into a regular expression: https://html.spec.whatwg.org/multipage/parsing.html#parse-st... but the result would not be worth it. Use a standard HTML parser.

Now, obviously, I'm taking a strict view of the term "HTML" in this case. Regular expressions can certainly be used to extract things from documents that you choose to view as a particular approximation of HTML. I've done it before and I'll probably do it again. But when I do, I'm not actually envisioning myself as "parsing HTML", what I'm doing is parsing a byte stream that happens to be HTML, but I'm just hacking around and getting something that works for the exact format this particular document happens to be in, which is a highly, highly restricted subset of HTML, especially since I probably only care about a very small part of it. But it's also an unspecified subset of HTML and may change without warning at any time, and I need to deal with that.

If I care about a lot of it, I find myself an HTML parser and an XPath implementation. If you do this a lot, it's worth learning, as it's very, very powerful and faster to develop with than regexes once you know what you're doing. If it's anything beyond the most trivial thing, I preferentially reach for this now that I've learned it. But there is a non-trivial learning curve to it. If you're just grabbing a particular price out of a page once, by all means use regexs.

gigel82 3 years ago

That's not really parsing HTML; well, I guess it is technically speaking parsing it, but most people understand building a tree (DOM) when they think of parsing HTML and that's not what those regex programs do.

Tainnor 3 years ago

HTML is not regular, so it can't be recognised by a "theoretical" regular expression, such as introduced in a theoretical CS class. Modern regex engines however, are more powerful and can recognise non-regular languages too.

Then there's a distinction to be made between recognising a language and parsing it.

This article goes into more detail: https://www.npopov.com/2012/06/15/The-true-power-of-regular-...

jove_ 3 years ago

As everyone has pointed out, this does not count. Note that the idea that regex can't parse html is specific and proven. What it means is that you can't write an expression that matches both the opening and matching closing tags. There's no way to handle nested tags within a single regex. It's only possible to write a regex that matches up to a finite nesting limit.

im3w1l 3 years ago
I think this is the difference between the theoretician and the practitioner. You see your interpretation is the obvious one for the former. But as any practitioner can tell a regular expression can't even parse a regular language!
See, normally the whole point of parsing something is to get data out right. And the way a regex gets data out is through capture groups. But herein lies the issue, a capture group can only capture one piece of information!
Consider a simple regular language: a non empty sequence of comma separated positive integers. We would like to get the integers out. An attempt
```
  (\d+)(,(\d+))*
```
The first group captures the first number, the second group is just something we introduced for the purpose of writing the regex, we don't care about the value. The third (inner) group should ideally capture all the subsequent numbers separately. But it doesn't! If you try to run that regex on 1,2,3,4,5,6,7,8,9 you will find that group 1 matches 1. And group 3 matches 9. Where did all the other numbers go?!
So really, you have to give the regex some outside help, maybe an outside loop, maybe splitting on a regex rather than parsing with one. Even for this simple language!
And when you are already doing that, why the step to giving it a bit more help, perhaps a stack, is quite small.
Tainnor 3 years ago

That is true of "theoretical" regexes, not of the ones actually used by modern languages.

valbaca 3 years ago

"You cannot make an alcoholic drink with water."

OP: "Oh Yes You Can Use Water to Make A Hard Drink. AH! But if I freeze water and pour in whiskey, I've used water to make an alcoholic drink."

-.-

srgpqt 3 years ago

What is hard seltzers?
Hard seltzer is a popular alcoholic drink that combines alcohol with flavored carbonated water. Compared to many other alcoholic drinks, hard seltzer is lower in alcohol content, calories, and sugar.
- valbaca 3 years ago
  
  > combines alcohol with flavored carbonated water
  COMBINES. That's exactly what I'm getting at. Water (alone) cannot make an alcoholic drink. Neither can regexes (alone) parse HTML. But of course, you can suspend alcohol within carbonated water and make a hard drink. In the same way you can utilize regexes to parse HTML.

wantguns 3 years ago

wodenokoto 3 years ago

The default engine in beautiful is/was "regex engine". Just saying.

warrenm 3 years ago

Can?

Yep

Should?

Most likely .. no :)

Settings

Oh Yes You Can Use Regexes to Parse HTML

Keyboard Shortcuts