Beta release: Predicated Recursive Descent for Rust

TL;DR: I made a Rust library (https://crates.io/crates/pred-recdec) where you can write BNF (Backus–Naur form) grammars that are as strong as manually-written recursive descent parsers. This is not a normal parser generator. The philosophy is different.

I’ve been working on parsing-related things for several months at this point, and I ended up with a design where I have an actual working C parser that only took a few days to build and another few days to churn through test cases, once I managed to build the abstractions needed to do it. (C parsers usually take months to write and churn through test cases for, not days.)

It’s not only for parsing C. You can write arbitrary parsers in it, like JSON, S-expressions, etc.

Performance wise, it’s slower than native code, but still fast enough to be entirely usable. The C parser is only slightly slower (10~20%) than GCC and Clang in -fsyntax-only mode. The JSON parser is “only” about 5x slower than a pure native code highly-optimized JSON implementation like serde_json, despite running in an interpreter and having a tokenizer (which JSON parsers usually don’t do). (Yes, my library has a tokenizer. No, you don’t have to write a separate lexical grammar.)

I named the end result “pred recdec”, for “predicated recursive descent”, and you can look at it and use it here:

https://github.com/wareya/pred_recdec/

https://crates.io/crates/pred-recdec

Here’s a pred recdec grammar for S-expressions:

S ::= @peek(0, "(") parenexpr
parenexpr ::=
    @peek(1, ")") "(" ")" # avoid bloating the AST with empty itemlists
    | "(" $become itemlist
itemlist ::=
    @peek(0, ")") ")"
    | @peek(0, "(") parenexpr $become itemlist
    | @auto r`[a-zA-Z_][a-zA-Z_0-9]*|[0-9\.]*`r $become itemlist

How does this work? What does it do?

This is basically a BNF-shaped domain specific language for writing handwritten recursive descent parsers.

Normally, handwritten parsers end up being a mess of macros or boilerplate, because you need to duplicate or generate the code for managing AST state and performing lookahead operations. In pred_recdec, as much as possible of that boilerplate behavior is controlled by the shape of the BNF grammar and syntactically-small guards. It also has tail call optimization (useful for left-factoring and lists), which is often hard to do in handwritten parsers because of the AST manipulation they need to do.

This is different still from how parser generators normally work. Normal parser generators handle local nondeterminism on your behalf, generating lookahead sets or state machines or performing memoization and backtracking, or some combination of those. My library does none of that: you have a direct view of the control flow that the parser actually ends up running, and you need to predicate every alternation/production manually. You don’t have to debug shift-reduce conflicts or first-follow set conflicts or table update rollback failures, just your own control flow logic.

It’s a half way point. You get most of the benefits of a formal grammar (easier to read, reason about, and dump into programs that manipulate grammars) without throwing away the capabilities of handwritten parsers (control flow is linear, so you can throw in as much impure stateful code as you want, as user-defined hooks or guards).

For that last point, here’s a snippet of my C99 grammar:

primary_expression ::=
    @peekr(0, r`[L]?\x22.*\x22`r) string_literal
    | @auto "_Generic" "(" !hook(many_balanced) ")"
    | @guard(is_ident_not_known_enum) $become_as identifier
    | $become constant

The @guard and !hook items refer to user-defined code. If you really need to, you can write an entire parser or memo engine in them. You shouldn’t, but you can. That’s where this ended up, it really is just as strong as manually-written recursive descent parsers, but shorter, easier to reason about, and confining the non-context-free-grammar parts to only where they’re really needed.

“Isn’t this just ANTLR?” No, ANTLR tries to be ambiguity-tolerant. That causes a whole bunch of deep-reaching differences.