All Aboard the Bootstrap Transpiler Express

6 min read Original article ↗

Locomotive

The Hardly Original Compiler Compiler (hocc) is now fully functional! Jump aboard for some compelling late-stage developments, and stick around to read what comes next for Hemlock development.

hocc developments

hocc development started in early 2022, and it turned out to be a much more challenging project than intended due to inclusion of the IELR(1) algorithm. See these previous blog posts for information about this and other challenges.

Bootstrapping

hocc is a parser generator, and as the “compiler compiler” moniker suggests, parser generators are a curious part of compiler bootstrapping, because they themselves are compilers which need bootstrapping. I manually wrote a recursive descent parser for hocc, and until quite recently my intention was to keep it indefinitely in order to simplify Hemlock bootstrapping. However, it became clear while implementing the hocc code generator (aka the compiler back end) that there were some creaky parts of the grammar that would require non-trivial refactoring. Anyone who tells you that recursive descent parsers are easy to work with is unlikely to have experienced the productivity gains afforded by a parser generator relative to the soul-crushing, error-prone slog of manual parser refactoring. Consider the following function from pre-bootstrapped hocc, which will serve to illustrate this and other later points.

and code_tl spine ctx =
  let spine = "code_tl" :: spine in
  let ctx', tok = next ~all:true spine ctx in
  match tok with
  | HmcToken {atok=Hmc.Scan.AbstractToken.(
    Tok_indent _|Tok_lparen|Tok_lcapture|Tok_lbrack|Tok_larray
    |Tok_lcurly); _} ->
    mapr ~child:delimited ~f:(fun spine ctx' delimited ->
      map ~child:code_tl ~f:(fun code_tl ->
        CodeTlDelimited {delimited; code_tl}
      ) ~fmt_child:fmt_code_tl spine ctx'
    ) spine ctx
  | HmcToken {atok=Hmc.Scan.AbstractToken.(
    Tok_dedent _|Tok_rparen|Tok_rcapture|Tok_rbrack
    |Tok_rarray|Tok_rcurly|Tok_line_delim|Tok_semi|Tok_bar); _} ->
    reduce spine ctx fmt_code_tl CodeTlEpsilon
  | HmcToken _ as token_ ->
    map ~child:code_tl ~f:(fun code_tl ->
      CodeTlToken {token_; code_tl}
    ) ~fmt_child:fmt_code_tl spine ctx'
  | _ -> reduce spine ctx fmt_code_tl CodeTlEpsilon

There’s some hair in this code (spine and fmt_*) related to parser tracing, a tax that all parsing functions had to pay. And some mildly mind-warping tail-recursive dispatch in mapr that I won’t get into. Now, take a look at the pattern with Tok_dedent and friends in it. That’s a follow set, i.e. token types that follow the CodeTlEpsilon grammar rule; they are excluded from CodeTlToken. There are footguns here.

  • The follow set is explicitly enumerated rather than calculated/encapsulated, here and elsewhere in the parser. But if we don’t do it this way, then we give up on pattern matching.
  • Manually listing all the valid token types for the CodeTlToken rule would be onerous, and in fact the parser has several latent bugs as a result of using catchall patterns. But exhaustively listing the token types would have an even worse copypasta effect than for the follow set pattern.

Contrast that with the equivalent hocc code.

    nonterm CodeTl of nonterm_code_tl ::=
      | delimited:Delimited code_tl:CodeTl ->
        CodeTlDelimited {delimited; code_tl}
      | code_token:CodeToken code_tl:CodeTl ->
        CodeTlCodeToken {code_token; code_tl}
      | epsilon prec pCodeTl -> CodeTlEpsilon

The addition of the CodeToken grammar rule was a direct result of hocc pointing out a grammar ambiguity that the recursive descent parser accidentally dodged via context-sensitive token classification. The bootstrapped hocc parser is more reliable than its predecessor, in addition to being easier to work with.

Follow-on creature comforts

The impetus for bootstrapping the hocc parser was a minor flaw in type syntax, but the ease of refactoring also made two enhancements feasible:

  • In several places it was necessary to do secondary pattern binding inside reduction callback code blocks. This required semi-duplicated callback code blocks with a bunch of boilerplate. For example:
      nonterm CodeToken of X.nonterm_code_token ::=
        # [...]
        | token_:UIDENT ->
          let X.(UIDENT {token_}) = token_ in
          CodeToken {token_}
        | token_:CIDENT ->
          let X.(CIDENT {token_}) = token_ in
          CodeToken {token_}
        | token_:"_"->
          let X.(USCORE {token_}) = token_ in
          CodeToken {token_}
        | token_:ISTRING ->
          let X.(ISTRING {token_}) = token_ in
          CodeToken {token_}
        # [...]
    

    With richer binding patterns this becomes:

      nonterm CodeToken of nonterm_code_token ::=
        # [...]
        | (UIDENT {token}):UIDENT
        | (CIDENT {token}):CIDENT
        | (USCORE {token}):"_"
        | (ISTRING {token}):ISTRING
        # [...]
        -> CodeToken {token}
    
  • You may have noticed the use of token_ as an identifier above. This was due to it being a hocc keyword. But it turns out that the keywords can be made contextual, such that with the exception of some restrictions on the hocc keyword, all the keywords can be used in embedded code without issue, as in the CodeToken patterns and reduction callback just above. This would have been really hard to get right in a recursive descent parser, but hocc enabled quick and safe refactoring (followed by quite a bit of misery manually removing all the unnecessary _ identifier suffix mangling).

Bootstrap transpilers

Hemlock development thus far has laid the groundwork for writing the initial Hemlock compiler in OCaml. That currently amounts to ~70,000 lines of OCaml code, not counting ~24,000 lines of hocc-generated code. That’s a lot of groundwork, considering that OCaml is already a good compiler implementation tool. The catch is that we wrote an entire standard library in OCaml called Basis that closely matches the intended Hemlock Basis. The idea was that even though we’d be faced with eventually rewriting the OCaml-based compiler in Hemlock, the rewrite would be largely mechanical due to similar standard libraries, which is a lot easier than a compiler redesign. Sometime while implementing both Hemlock and OCaml back ends for hocc I realized just how mind-numbingly mechanical this rewrite would be. Sounds like a job for a transpiler, better yet two transpilers!

We can use a pair of transpilers, which for now I’ll call mlc for OCaml->Hemlock, and hmc for Hemlock->OCaml. They can share an intermediate representation (IR) that starts off quite simple and grows in functionality as we implement aspects of semantic analysis. The journey from today to a self-hosting Hemlock system requires roughly the following steps.

  • Implement an OCaml scanner/parser and a Hemlock parser (scanner already implemented), both of which target the same IR, barely removed from the respective abstract syntax trees (ASTs).
  • As soon as the current OCaml code can survive OCaml->Hemlock->OCaml transpilation without loss of functionality, do a final OCaml->Hemlock translation, and throw away both the OCaml code and mlc.
  • Incrementally develop sufficient semantic analysis to support a native back end.
  • Implement a native back end (x64 and/or arm64), as well as a native runtime.
  • As soon the native back end and runtime can generate an hmc binary that can bootstrap Hemlock, discard the OCaml back end? Alternatively, keep Hemlock language requirements in hmc simple enough that hmc can continue to be bootstrapped by the OCaml compiler.

This approach has the obvious advantage of replacing manual OCaml->Hemlock rewriting with the effort of writing mlc. Even better, we can start running non-trivial Hemlock programs much sooner than if the native runtime were a hard dependency. There’s likely to be at least one round of Advent of Code based on the OCaml back end, perhaps even this decade!