Short bit: lexing and non-lexing scanners (parsing)

6 min read Original article ↗

I’ve been poking around the internet looking at people writing about computer language grammar parsing stuff for the past few months, and there’s a point of persistent confusion that I’ve been running into, and it’s made it harder to sort through the data and opinions. Here’s what the point of confusion is and what’s actually going on.

There’s a class of parsers between the scannerless kind and the lexing kind. This class of parser scans the input (it has a scanner; it chunks the input) without performing lexical analysis (assigning identities/types to those chunks).

  • Scannerless parser: Operates directly on characters, like regexes. No tokenization at all.
  • Mysterious middle class: ?????
  • Lexing: A phase in some parsers that turns the input text into a sequence of tokens with strict lexical identities, like “IDENTIFIER”, “COMMA”, “OPENBRACKET”, etc.

What happens if you try to turn the input text into a sequence of tokens, but just… don’t give them lexical identities? This isn’t theoretical. There are a lot of parsers that do this. Including some fast ones.

Some people assume that “scannerless” must mean “doesn’t have a separate lexical grammar”. Some other people assume that “scannerless” must mean “doesn’t perform lexical analysis”. (These two groups have some overlap). This is not always correct. A tokenizer that takes a String and outputs an Array<String> according to the same character spans as a tokenizer for an LR grammar, can still be called a scanner. A parser that uses the output of that tokenizer is still usually called scannerful. Even if the lexical grammar is derived from the syntax tree grammar, and even if those substrings have no associated lexical identity.

This isn’t a distinction without a difference: being scannerful is usually a necessary condition for a parser to be Fast. So if you’re reading something about “scannerless” parsers and their behavior or performance, you should be aware that it probably means “doesn’t chunk at all”, rather than “doesn’t give lexical identities” (though, depending on the writer, it can mean the latter, too). And also, “lexing” can make or break a parsing algorithm. LR requires lexing, because otherwise the transition tables would be impossible to build. Earley discourages lexing, because it supports highly ambiguous grammars and you’re throwing some of that power away if you pre-disambiguate lexical identities.

So:

  • Scannerless (like regex)
  • Scannerful, but non-lexing (like some HTML/JSON parsers) (parser sees strings)*
  • Scannerful, but lexing (like LR parsers) (parser sees lexeme identities)

(* Yes, some HTML/JSON parsers skip lexing, yes, really, yes, I’ve seen it happen with my own eyes, yes, I know that they’re “doing it wrong” in some sense. )

Each level does more disambiguation up front than the last. Scannerful non-lexing parsers can’t parse quite as many ambiguous grammars as scannerless parsers can. Lexing parsers can’t parse quite as many ambiguous grammars as scannerful non-lexing parsers can. Performance also increases with each level. But the difference in performance between scannerless and scannerful-nonlexing is Massive. The difference between scannerful-nonlexing and lexing is real, but relatively small. You can often ignore it (aside from LR parsing requiring lexing to work at all).

If you’re worried about scannerful-nonlexing parsers being slow because of operating on strings, don’t worry. The scanner can intern strings from a cache so that all identical strings have the same pointer or identity. Yes, this is cheap. It can also precompute which regexes are capable of applying to which tokens so that it doesn’t have to check again later during parsing.

Also, if you’re wondering “wait, then how am I supposed to make my LR parser handle soft keywords (ambiguous lexical grammar)”, the answer is “on-demand lexing”. If you lex on demand then you can instruct the lexer to ignore certain lexeme classes when you’re at certain states or rules. Yes, this is different from scannerless parsing.

If you had this point of confusion, you weren’t wrong to be confused. This section of parsing theory is taught very poorly because everyone focuses on maximum-performance maximum-reliability things like LR or fully generalized things meant for natural language processing like CYK. Even more confusing, PEG/Packrat parsing suites are randomly either scannerless or scannerful, and the only easy way to know is if you benchmark them against each other, because they almost always derive the lexical grammar (for scannerful ones) from the syntax tree grammar. So it’s hard to run into this distinction organically if you aren’t yourself personally working on parsers.

Per the Dragon Book (2006 edition):

    Sometimes, lexical analyzers are divided into a cascade of two processes:
        a) Scanning consists of the simple processes that do not require tokenization of the input, such as deletion of comments and compaction of consecutive whitespace characters into one.
        b) Lexical analysis proper is the more complex portion, where the scanner produces the sequence of tokens as output.

Read sections A and B carefully. For phase A, “scanning”, prior tokenization is not necessary. For phase B, “lexical analysis proper”, the “scanner” has already produced a sequence of tokens as output. Let’s bold that: it has “… where the scanner produces the sequence of tokens as output” as an earlier portion before lexical analysis. In other words, tokenization needs to have happened during (or, in even weirder niche architectures, even before) the “scanner” phase. This scanner phase is treated as, at least “sometimes”, being separate from before the “lexical analysis” phase.

The hidden assumption that the Dragon Book is making here, and why it sounds slightly confused (even though it’s not), is that it thinks that scanning without eventually doing lexical analysis is useless or rare, so A will always eventually feed into B, right? I want to note that this assumption was not wrong at the time, nor was it lacking forethought: this was the right way to explain these problems to learners at the time. But this is not true anymore: you end up with some parsing systems that do A without bothering to do B. But for a really long time, we existed in a world where almost all scanners were tightly bound to lexers, and usually implemented as the same chunk of code, so the distinction temporarily blurred together. Today, parsing architectures can be a lot weirder than you’d think!