Have you ever woken up in the morning feeling like writing a programming language? No? That's... pretty strange. But we're not here to discuss where things went sideways in our lives. Let's just write a programming language instead.
Writing a programming language is a great way to satisfy our latent god complex. We get to create a world with our own rules. Don't like it when someone puts the curly brace on a new line after an if? Throw a syntax error at them! Camel case function names make your skin crawl? Ban them! For some bizarre reason, you want every variable to start with a $? Now's the perfect chance!
A project of this scale usually begins with a ton of upfront planning: from deciding what problem the new language is meant to solve to tiny details like how variable scope works or whether there should be a dedicated operator for string concatenation. We're going to skip all of that and dive straight into the middle of it... or some part of it. Hard to say, since we didn't do the planning.
Table of contents
- First steps
- The lexical analyzer
- Operators
- Precedence
- The syntactic analyzer
- Statements
- The evaluator
- Parentheses
- Numerous options
- Variables
- Control flow
- 'Tis the end
First steps
In the beginning was the plain text. Characters give birth to tokens, from which we grow a syntax tree. The tree is a flexible structure, you can make all sorts of things out of it, like a native binary or bytecode for a virtual machine. But you can also just execute it directly.
To get started, we'll need a programming language, since our own is not in a state yet where it could be written in itself. I chose OCaml because I enjoy tormenting myself with things I barely understand. And I mean that quite literally. At the start of the project, I had zero knowledge of OCaml, and about as much familiarity with functional languages as you can pick up from imperative languages.
For the first milestone, I'd like to convert boolean values into tokens, since they consist only of true and false. Right away, we have to make some decisions. Should these values be separate tokens (TTrue and TFalse), or should there be a single TBool token that can take on true or false? I don't think it's such a big decision. I went with the latter.
type token =
| TBool of bool
| TEOF
let show_token token =
match token with
| TBool b -> Printf.sprintf "TBool(%B)" b
| TEOF -> "TEOF"
OCaml's syntax might feel a bit strange. I'm looking at type as if it's a smarter enum, and match is just a smarter switch.
The token type will hold all of our tokens, and show_token is just a helper function we'll use during development to print things to the screen while our own language doesn't have that capability yet. We also added an end-of-file token, which is meant to signal the end of the program.
The lexical analyzer
Also known as lexer. The fancy name hides a simple mechanism. It reads the plain text one character at a time and builds tokens from it. For this, it helps to know where we are at any given moment, so we define a struct-like construct (I think OCaml calls it a record) so we can store the data we need:
exception LexError of string
type lexer_state = {
src : string;
len : int;
mutable pos : int;
mutable line : int;
mutable col : int;
}
From the perspective of the analyzer, pos is what matters to us. line and col are only there so we can generate more user-friendly error messages.
We'll also define a few helper functions to make our lives easier:
let make_lexer src = {
src;
len = String.length src;
pos = 0;
line = 1;
col = 1;
}
let step ls =
ls.pos <- ls.pos + 1;
ls.col <- ls.col + 1
let step_line ls =
ls.pos <- ls.pos + 1;
ls.line <- ls.line + 1;
ls.col <- 1
let current_char ls = ls.src.[ls.pos]
make_lexer produces an initial state. step is a simple step forward within the same line, while step_line is also moving to a new line. current_char returns the character we're currently looking at.
Thanks to the nature of the language, we'll solve most of our problems with pattern matching and recursion.
let tokenize src =
let ls = make_lexer src in
let rec loop tokens =
if ls.pos < ls.len then
match current_char ls with
| '\n' ->
step_line ls;
loop tokens
| ' ' | '\t' | '\r' ->
step ls;
loop tokens
| 'a'..'z' ->
loop (tokenize_name ls :: tokens)
| c ->
raise (LexError (Printf.sprintf "Unexpected character '%c' on line %d col %d" c ls.line ls.col))
else
TEOF :: tokens
in
List.rev (loop [])
We create an instance of our struct (ls) and a recursive function (loop), which we then call with an empty list. The recursion continues until we reach the end of the source code. At that point, it prepends a TEOF token to the list. For some reason, this language really likes adding things to the front of lists, which is why we have to reverse the list at the end with List.rev.
If we haven't reached the end of the file yet, we examine the current character. For whitespace characters, we simply step forward without adding anything to the list. If we encounter alphabetic data, we interpret it with another function and append it to the tokens; otherwise, we throw an error for an unknown character (where we can use our line and col values).
let tokenize_name ls =
let is_name_char ch = ch >= 'a' && ch <= 'z' in
let start_col = ls.col in
let start = ls.pos in
while ls.pos < ls.len && is_name_char (current_char ls) do
step ls
done;
let name = String.sub ls.src start (ls.pos - start) in
match name with
| "true" -> TBool true
| "false" -> TBool false
| _ -> raise (LexError (Printf.sprintf "Unexpected name '%s' on line %d col %d" name ls.line start_col))
For names, we keep stepping forward as long as we see valid characters (for which we define a small helper function, is_name_char). After that, based on the positions we've gathered, we extract the name from the source and try to interpret it. Our humble little language understands nothing except true and false for now.
All that's left is a "main" function to try out our excellent lexical analyzer:
let () =
let src = {|true false
true
true false
false|} in
try
let tokens = tokenize src in
print_endline (String.concat "; " (List.map show_token tokens))
with LexError e -> print_endline ("Fatal error: " ^ e)
We tokenize the source code and print the resulting tokens using the previously defined show_token function. The result will look something like this:
$ ocaml main.ml
TBool(true); TBool(false); TBool(true); TBool(true); TBool(false); TBool(false); TEOF
Of course, it might happen that we forget what tokens our language understands...
true false
false true true
true oops false
true
false
true
false
...and we get an ugly error:
$ ocaml main.ml
Fatal error: Unexpected name 'oops' on line 3 col 6
The full source of this step can be found here.
Although we now have tokens, having values alone isn't quite complex enough to grow a syntax tree. So we'll need something that actually makes use of these values.
Operators
To get started, we could define a few operators. We could go with the usual C-style trio of !, &&, ||, or opt for the slightly more Python-esque not, and, or. It doesn't change the end result, it just makes the lexical analyzer a bit different. I went with the latter.
First, we'll need some new token types:
type token =
| TBool of bool
| TBoolAnd
| TBoolOr
| TBoolNot
| TEOF
Which, unfortunately, means we also have to update our show_token function with new match branches:
let show_token token =
match token with
| TBool b -> Printf.sprintf "TBool(%B)" b
| TBoolAnd -> "TBoolAnd"
| TBoolOr -> "TBoolOr"
| TBoolNot -> "TBoolNot"
| TEOF -> "TEOF"
Then we handle the new names in the tokenize_name function:
match name with
| "true" -> TBool true
| "false" -> TBool false
| "and" -> TBoolAnd
| "or" -> TBoolOr
| "not" -> TBoolNot
| _ -> raise (LexError (Printf.sprintf "Unexpected name '%s' on line %d col %d" name ls.line start_col))
If we run our program with the source code not true or false and true, we'll now get the following output:
$ ocaml main.ml
TBoolNot; TBool(true); TBoolOr; TBool(false); TBoolAnd; TBool(true); TEOF
The changes in this step can be found here.
Now we have boolean values and operators that actually do something with them, so we finally have all we need to build our tree.
Precedence
Before we dive into the actual parsing, it's worth noting a few words about operator precedence, since it largely dictates how the parser will work and how it's structured.
In general, unary operators take precedence over binary ones (we interpret the expression not false and true as (not false) and true, even though we don't have parentheses yet).
It can also happen that two operators share the same precedence level (like - and +). In those cases, evaluation can work in several ways:
- we simply don't allow it (for example, we might not want something like
1 < 2 < 3in our language, only1 < 2 and 2 < 3) - we evaluate it from left to right (
1 + 2 - 3becomes(1 + 2) - 3) - we evaluate it from right to left (
1 + 2 - 3becomes1 + (2 - 3))
It's worth taking a look at how "finished" programming languages (like C++) document their operator precedence rules.
The syntactic analyzer
The syntactic analyzer (parser) takes the list of tokens and tries to interpret them according to its own rules, building a syntax tree in the process. Given the tokens our language currently supports, the nodes of this tree can be the following:
type expr =
| EBool of bool
| EBoolNot of expr
| EBoolAnd of expr * expr
| EBoolOr of expr * expr
EBool is similar to TBool, but the others have a bit more interesting. They can take parameters of type expr, which makes the type itself recursive. For example, we can write something like EBoolOr(EBoolNot(EBool(true)), EBool(false)). The * here doesn't mean multiplication, it's just a separator between parameters.
The above expression as a tree
We'll also have a helper function here as well to print the constructed expression:
let rec show_expr e =
match e with
| EBool b -> Printf.sprintf "EBool(%B)" b
| EBoolNot e' -> Printf.sprintf "EBoolNot(%s)" (show_expr e')
| EBoolAnd (l, r) -> Printf.sprintf "EBoolAnd(%s, %s)" (show_expr l) (show_expr r)
| EBoolOr (l, r) -> Printf.sprintf "EBoolOr(%s, %s)" (show_expr l) (show_expr r)
Since the type is recursive, we need to call show_expr again on parameters of type expr. The usual pattern matching and recursion. The e' in the EBoolNot line isn't anything special, the apostrophe is simply part of the variable name.
Similar to the lexical analyzer, the parser will also have an internal state and a few helper functions.
exception ParseError of string
type parser_state = {
mutable tokens : token list;
}
let make_parser tokens = { tokens }
let peek ps =
match ps.tokens with
| t :: _ -> t
| [] -> TEOF
let consume ps =
match ps.tokens with
| t :: rest -> ps.tokens <- rest; t
| [] -> TEOF
For now, parser_state only contains our tokens. By the time we get here, we've lost track of line and column information, so we have to give up on user-friendly error messages from now on.
peek returns the next token without removing it from the list. You can see how smart match can be here: the t :: _ structure might look familiar from tokenize as the operator for prepending to a list, but in this context, match makes the value of t to the first element of the list, and we simply don't care about the rest (_).
consume also returns the next token, but removes it from the list (the remainder becomes the new value of ps.tokens).
Now we can move on to the actual parsing. The operator precedence we want to achieve is:
- literal values (
parse_primary) - the boolean
notoperator (parse_bool_unary) - the boolean
andandoroperators (parse_bool_expr)
We'll start by parsing expressions... since we only have expressions. And only a boolean type, so in the end, we'll only be dealing with boolean expressions. In the code, we move from weaker operators down to stronger ones.
let rec parse_expr ps =
parse_bool_expr ps
and parse_bool_expr ps =
let rec loop left =
match peek ps with
| TBoolAnd ->
ignore (consume ps);
let right = parse_bool_unary ps in
loop (EBoolAnd (left, right))
| TBoolOr ->
ignore (consume ps);
let right = parse_bool_unary ps in
loop (EBoolOr (left, right))
| _ -> left
in
loop (parse_bool_unary ps)
parse_expr doesn't do much, it just felt wrong to start expression parsing with something as specific as parse_bool_expr.
The mutually recursive functions are chained together with and. Here we reuse the recursive loop pattern from tokenize, which receives the left operand of a binary operator from a lower level. If it finds an and or or, it builds the corresponding tree node and then recursively looks for more of them. If it doesn't, it returns what it has collected so far (or the original left operand).
It's worth noting that traditionally, among boolean operators, and has higher precedence than or. In our current implementation, however, they are on the same level and evaluated left to right.
Since we are the creators of this world, we can treat this either as intended behavior or as a bug. If you go with the latter, fixing it is a great exercise: you'd need to split parse_bool_expr into two levels: parse_bool_or_expr, which calls parse_bool_and_expr, which calls parse_bool_unary.
The next level is the unary operator (not):
and parse_bool_unary ps =
match peek ps with
| TBoolNot ->
ignore (consume ps);
EBoolNot (parse_bool_unary ps)
| _ -> parse_primary ps
If we find a not token, we return the corresponding tree node (whose value can itself be another not or a simple value). Otherwise, we try to parse a simple value.
That leaves us with handling the simple values:
and parse_primary ps =
match peek ps with
| TBool b ->
ignore (consume ps);
EBool b
| t -> raise (ParseError (Printf.sprintf "Unexpected token '%s'" (show_token t)))
If the next token is true or false, we extend the tree with an EBool node; otherwise, we throw an error because we didn't expect that token.
Finally, the function that ties everything together:
let parse_program tokens =
let ps = make_parser tokens in
let expr = parse_expr ps in
let t = peek ps in
if t <> TEOF then
raise (ParseError (Printf.sprintf "Unexpected token '%s'" (show_token t)));
expr
Since our language currently consists of a single expression, if there are tokens left after parsing the expression other than TEOF then something went wrong.
In our new "main" function, we build the tree and print it:
let () =
let src = {|not true or false and true|} in
try
let tokens = tokenize src in
print_endline (String.concat "; " (List.map show_token tokens));
let ast = parse_program tokens in
print_endline (show_expr ast)
with
| LexError e -> print_endline ("Fatal error: " ^ e)
| ParseError e -> print_endline ("Parse error: " ^ e)
And the output looks something like this:
$ ocaml main.ml
TBoolNot; TBool(true); TBoolOr; TBool(false); TBoolAnd; TBool(true); TEOF
EBoolAnd(EBoolOr(EBoolNot(EBool(true)), EBool(false)), EBool(true))
All that recursion can shake one's world quite a bit, so it's worth taking a moment to think through how the tree is actually built.
parse_bool_exprcallsparse_bool_unary, which finds aTBoolNottoken and returns anEBoolNotnode, whose value is computed by calling itself again- the next
TBooltoken doesn't have a match inparse_bool_unary, so it flows down toparse_primary, where it becomes anEBool, and we jump back toparse_bool_expr - in
parse_bool_expr, we enter theloop, which finds aTBoolOrtoken and turns it into anEBoolOrnode; its left operand is the previously obtainedEBoolNot(EBool(true)), and to compute the right operand, it callsparse_bool_unary parse_bool_unarydoesn't see a unary operator, so it callsparse_primary, which returns anEBool, and we jump back into theloop- our
EBoolOrnode is complete, and using it as the left operand, we callloopagain - the
loopfinds aTBoolAndtoken, which becomes anEBoolAnd, and it looks for a right operand viaparse_bool_unary parse_bool_unarydoesn't see a unary operator, so it callsparse_primary, which returns anEBool, and we jump back into theloop- the next token is
TEOF, which matches nothing, so we exit both theloopandparse_bool_expr parse_programseesTEOF, and that fills it with joy
Of course, there are cases where parse_program is not quite so joyful. For example, let's try false not and true:
$ ocaml main.ml
TBool(false); TBoolNot; TBoolAnd; TBool(true); TEOF
Parse error: Unexpected token 'TBoolNot'
The changes in this step can be found here.
At this point, our program consists of a single expression. We could evaluate it, but nothing actually makes use of the result. What we need next is a statement.
Statements
My first thought was that it would be nice if our program could actually produce some output. With that in mind, print seems like a solid choice for our first statement.
Once again, we get to make a bunch of decisions about syntax. I went with separating statements by newlines, without introducing a dedicated character for it (like the classic ;). The statement itself will look like a function call: print(expression).
First, we'll need a few new tokens:
type token =
(* ... *)
| TLeftParen
| TRightParen
| TNewLine
| TIdentifier of string
(* ... *)
From here on, the newline itself becomes a token, since it carries meaning in the language. TIdentifier will be a general identifier, which we'll use for print. The (* and *) mark comments in OCaml, so they're just standing in for previously defined parts to avoid repeating everything.
Naturally, we also need to update show_token with the new arrivals:
let show_token token =
match token with
(* ... *)
| TLeftParen -> "TLeftParen"
| TRightParen -> "TRightParen"
| TNewLine -> "TNewLine"
| TIdentifier s -> Printf.sprintf "TIdentifier(%s)" s
(* ... *)
So far, nothing too surprising. The tokenize_name part of our lexical analyzer gets a bit simpler, since from now on we accept all alphabetic identifiers at the lexical level:
let tokenize_name ls =
(* ... *)
match name with
| "true" -> TBool true
| "false" -> TBool false
| "and" -> TBoolAnd
| "or" -> TBoolOr
| "not" -> TBoolNot
| _ -> TIdentifier name
Unknown identifiers will now throw errors during parsing instead.
We also need to tweak the pattern matching in tokenize a bit to support the new language elements:
match current_char ls with
| ' ' | '\t' | '\r' ->
step ls;
loop tokens
| '\n' ->
step_line ls;
loop (TNewLine :: tokens)
| '(' ->
step ls;
loop (TLeftParen :: tokens)
| ')' ->
step ls;
loop (TRightParen :: tokens)
| 'a'..'z' ->
loop (tokenize_name ls :: tokens)
| c ->
raise (LexError (Printf.sprintf "Unexpected character '%c' on line %d col %d" c ls.line ls.col))
For newline characters, we no longer just step forward, we also emit a token. We also handle opening and closing parentheses.
The syntax tree gets a new type for statements, with print as its first member:
type stmt =
| SPrint of expr
For easier debugging, it's worth adding a corresponding show_stmt function as well:
let show_stmt s =
match s with
| SPrint e -> Printf.sprintf "SPrint(%s)" (show_expr e)
For parsing, we'll need two new helper functions:
let expect ps tok =
let t = consume ps in
if t <> tok then
raise (ParseError (Printf.sprintf "Unexpected token '%s', expected '%s'" (show_token t) (show_token tok)))
let skip_newlines ps =
while peek ps = TNewLine do ignore (consume ps) done
expect throws an error if the next token isn't what we expect. With skip_newlines, we can ignore empty lines.
The good news is that we don't need to touch our parse_expr recursion chain at all. Instead, we start a new one:
let rec parse_stmts ps =
let rec loop stmts =
skip_newlines ps;
match peek ps with
| TEOF -> ignore (consume ps); stmts
| _ -> loop (parse_stmt ps :: stmts)
in
List.rev (loop [])
The recursive structure might look familiar from tokenize: we ignore empty lines, exit recursion at the end of the file, otherwise process the next statement, and prepend it to the list. And at the end, we have to reverse the list here as well.
and parse_stmt ps =
match peek ps with
| TIdentifier "print" ->
ignore (consume ps);
expect ps TLeftParen;
let expr = parse_expr ps in
expect ps TRightParen;
SPrint(expr)
| t -> raise (ParseError (Printf.sprintf "Unexpected token '%s'" (show_token t)))
Here's the corresponding parse_stmt, which handles the print identifier. After the identifier, we expect an opening parenthesis, followed by an expression, and then a closing parenthesis. We don't support any other statements yet, so everything else results in an error (this is the new home of the exception we removed from tokenize_name).
After this, parse_program needs a bit of simplification:
let parse_program tokens =
let ps = make_parser tokens in
parse_stmts ps
And our "main" function changes slightly as well:
let () =
let src = {|print(false or true)
print(false)|} in
try
let tokens = tokenize src in
print_endline (String.concat "; " (List.map show_token tokens));
let stmts = parse_program tokens in
print_endline (String.concat "\n" (List.map show_stmt stmts));
with
| LexError e -> print_endline ("Fatal error: " ^ e)
| ParseError e -> print_endline ("Parse error: " ^ e)
As output, we should now see something like this:
$ ocaml main.ml
TIdentifier(print); TLeftParen; TBool(false); TBoolOr; TBool(true); TRightParen; TNewLine; TNewLine; TIdentifier(print); TLeftParen; TBool(false); TRightParen; TEOF
SPrint(EBoolOr(EBool(false), EBool(true)))
SPrint(EBool(false))
The changes in this step can be found here.
At this point, we no longer have just a single syntax tree, we have one per line. In a larger program, we're effectively growing an entire syntax forest.
We've now reached a stage where there's something meaningful to evaluate. If we were to run this program, it would produce visible results. So let's start evaluating.
The evaluator
The evaluator walks through the syntax trees, statement by statement, and executes them.
Surprisingly enough, this step also starts with defining a new type and writing its corresponding show_ function:
type value =
| VBool of bool
let show_value v =
match v with
| VBool b -> Printf.sprintf "%B" b
Next up: evaluating expressions.
let rec eval expr =
match expr with
| EBool b -> VBool b
| EBoolNot e ->
(match eval e with
| VBool b -> VBool (not b))
| EBoolAnd (l, r) ->
(match eval l, eval r with
| VBool a, VBool b -> VBool (a && b))
| EBoolOr (l, r) ->
(match eval l, eval r with
| VBool a, VBool b -> VBool (a || b))
It should come as no surprise that we're once again relying on pattern matching and recursion. Since we don't have anything beyond booleans yet, the pattern matching inside eval might look a bit odd, but it will come in handy later.
let exec stmt =
match stmt with
| SPrint e ->
print_endline (show_value (eval e))
Inside exec, print_endline is a built-in function, and show_value finally graduates from a mere debugging tool to something we actually use during evaluation.
All that's left is to tweak the "main" function so it executes the statements:
let () =
let src = {|print(false or not true)
print(not false and true)|} in
try
let tokens = tokenize src in
let stmts = parse_program tokens in
List.iter exec stmts
with
| LexError e -> print_endline ("Fatal error: " ^ e)
| ParseError e -> print_endline ("Parse error: " ^ e)
And the output should (hopefully) be exactly what we expect:
$ ocaml main.ml
false
true
The changes in this step can be found here.
At this point, we essentially have a complete program that can evaluate boolean expressions and print their results. Still, it would be a shame to stop here.
Parentheses
Right now, if we try to run print(not (false or not true)), we get the following error:
$ ocaml main.ml
Parse error: Unexpected token 'TLeftParen'
Unfortunately, the language's error handling doesn't help us understand exactly where the problem comes from, but we can reasonably suspect the parse_primary function. So if we want to support arbitrary parenthesized expressions, we can introduce a new function into the expression parsing chain:
- literal values (
parse_primary) - parentheses (
parse_bracketing) - boolean
notoperator (parse_bool_unary) - boolean
andandoroperators (parse_bool_expr)
and parse_bracketing ps =
match peek ps with
| TLeftParen ->
ignore (consume ps);
let expr = parse_expr ps in
expect ps TRightParen;
expr
| _ -> parse_primary ps
We don't strictly need a new function here. We could just add another branch to the parse_primary pattern matching, but this feels like a cleaner way to express operator precedence.
After this change, we'll get the correct output:
$ ocaml main.ml
true
The changes in this step can be found here.
It's honestly kind of impressive that we implemented full parenthesized expressions with less than ten lines of code. Time to take on something with a bit more structural ambition.
Numerous options
We now have a working pipeline that can interpret and execute raw text. Adding something like numbers is no longer particularly surprising from a code perspective. All layers need to be touched, but the changes are straightforward, even if they are relatively large. First, the tokens:
type token =
(* ... *)
| TNumber of float
| TLt
| TLtEq
| TGt
| TGtEq
| TEq
| TNotEq
| TPlus
| TMinus
| TAsterisk
| TSlash
| TPercent
(* ... *)
Quite a lot of new tokens: we're implementing comparison operators <, <=, >, >=, ==, <>, as well as arithmetic operators +, -, *, /, %. Naturally, the show_token function also needs to be extended, but I'll leave that to the reader's imagination based on the previous expansions.
The lexer's pattern matching in the tokenize function is extended with the following:
match current_char ls with
(* ... *)
| '+' -> step ls; loop (TPlus :: tokens)
| '-' -> step ls; loop (TMinus :: tokens)
| '*' -> step ls; loop (TAsterisk :: tokens)
| '/' -> step ls; loop (TSlash :: tokens)
| '%' -> step ls; loop (TPercent :: tokens)
| '<' ->
step ls;
(match current_char ls with
| '=' -> step ls; loop (TLtEq :: tokens)
| '>' -> step ls; loop (TNotEq :: tokens)
| _ -> loop (TLt :: tokens))
| '>' ->
step ls;
(match current_char ls with
| '=' -> step ls; loop (TGtEq :: tokens)
| _ -> loop (TGt :: tokens))
| '=' ->
step ls;
(match current_char ls with
| '=' -> step ls; loop (TEq :: tokens)
| c -> raise (LexError (Printf.sprintf "Unexpected character '%c' on line %d col %d" c ls.line ls.col)))
| '0' .. '9' ->
loop (tokenize_number ls :: tokens)
(* ... *)
A new thing here is that some operators require looking at two characters at once. Also, we don't yet have a standalone = operator, so we treat that as an error for now.
This also introduces a subtle bug: after step ls, the subsequent current_char call does not check whether we've reached the end of the source code. As a result, we might get an unexpected OCaml exception (Invalid_argument "index out of bounds".) on otherwise invalid input like print(1 <.
The tokenize_number function looks almost identical to the earlier tokenize_name, except it deals with digits instead:
let tokenize_number ls =
let is_number_char ch = ch >= '0' && ch <= '9' || ch == '.' in
let start = ls.pos in
while ls.pos < ls.len && is_number_char (current_char ls) do
step ls
done;
TNumber (float_of_string (String.sub ls.src start (ls.pos - start)))
The syntax tree also expands with new node types for the operators we've introduced:
type expr =
(* ... *)
| ENumber of float
| EGt of expr * expr
| EGtEq of expr * expr
| ELt of expr * expr
| ELtEq of expr * expr
| EEq of expr * expr
| ENotEq of expr * expr
| EAdd of expr * expr
| ESub of expr * expr
| EMul of expr * expr
| EDiv of expr * expr
| EMod of expr * expr
(* ... *)
We return to the parser, where the question of operator precedence inevitably comes up again. The new operators fit into our current system like this:
- literal values (
parse_primary) - parentheses (
parse_bracketing) - multiplication and division (
parse_mul_div) - addition and subtraction (
parse_add_sub) - comparison operators (
parse_comparison) - boolean
notoperator (parse_bool_unary) - boolean
andandoroperators (parse_bool_expr)
So we need to insert three new functions right after parse_bool_unary:
and parse_comparison ps =
let left = parse_add_sub ps in
match peek ps with
| TLt -> ignore (consume ps); ELt (left, parse_add_sub ps)
| TLtEq -> ignore (consume ps); ELtEq (left, parse_add_sub ps)
| TGt -> ignore (consume ps); EGt (left, parse_add_sub ps)
| TGtEq -> ignore (consume ps); EGtEq (left, parse_add_sub ps)
| TEq -> ignore (consume ps); EEq (left, parse_add_sub ps)
| TNotEq -> ignore (consume ps); ENotEq (left, parse_add_sub ps)
| _ -> left
The structure is similar to what we've seen before, except we don't use recursion within the same level. I didn't want to support expressions like 1 < 2 >= 0 == 0. Some languages partially allow this (Python, for example, supports a < b < c). It's just a matter of taste.
and parse_add_sub ps =
let rec loop left =
match peek ps with
| TPlus -> ignore (consume ps); loop (EAdd (left, parse_mul_div ps))
| TMinus -> ignore (consume ps); loop (ESub (left, parse_mul_div ps))
| _ -> left
in
loop (parse_mul_div ps)
Here we have the familiar internal recursive loop, so expressions like 1 + 2 + 3 - 4 are now possible and will be evaluated left to right.
and parse_mul_div ps =
let rec loop left =
match peek ps with
| TAsterisk -> ignore (consume ps); loop (EMul (left, parse_bracketing ps))
| TSlash -> ignore (consume ps); loop (EDiv (left, parse_bracketing ps))
| TPercent -> ignore (consume ps); loop (EMod (left, parse_bracketing ps))
| _ -> left
in
loop (parse_bracketing ps)
The multiplication/division variant doesn't add much new conceptually. It follows the same logic, just calling the next function in the precedence chain.
A small adjustment is also made in parse_primary:
and parse_primary ps =
match peek ps with
| TBool b -> ignore (consume ps); EBool b
| TNumber n -> ignore (consume ps); ENumber n
| t -> raise (ParseError (Printf.sprintf "Unexpected token '%s'" (show_token t)))
Just like with TBool, we now also handle a TNumber token.
With that, we return to evaluation. Naturally, the value type expands, along with show_value:
exception RuntimeError of string
type value =
| VBool of bool
| VNumber of float
let show_value v =
match v with
| VBool b -> string_of_bool b
| VNumber n ->
if Float.is_integer n then string_of_int (int_of_float n)
else string_of_float n
Then we extend eval:
let rec eval expr =
match expr with
| EBool b -> VBool b
| ENumber b -> VNumber b
| EBoolNot e ->
(match eval e with
| VBool b -> VBool (not b)
| _ -> raise (RuntimeError "Invalid type"))
| EBoolAnd (l, r) -> eval_bool_arithmetic ( && ) l r
| EBoolOr (l, r) -> eval_bool_arithmetic ( || ) l r
| EGt (l, r) -> eval_comparison ( > ) l r
| EGtEq (l, r) -> eval_comparison ( >= ) l r
| ELt (l, r) -> eval_comparison ( < ) l r
| ELtEq (l, r) -> eval_comparison ( <= ) l r
| EEq (l, r) -> eval_comparison ( = ) l r
| ENotEq (l, r) -> eval_comparison ( <> ) l r
| EAdd (l, r) -> eval_arithmetic ( +. ) l r
| ESub (l, r) -> eval_arithmetic ( -. ) l r
| EMul (l, r) -> eval_arithmetic ( *. ) l r
| EDiv (l, r) -> eval_arithmetic ( /. ) l r
| EMod (l, r) ->
(match eval l, eval r with
| VNumber a, VNumber b -> VNumber (float_of_int (int_of_float a mod int_of_float b))
| _ -> raise (RuntimeError "Invalid type"))
Like the arithmetic operators, the binary and/or operators also got a new helper function. The first argument after the function name is the operator in parentheses. This way, it can be passed as an argument like any other function. The arithmetic operators look slightly unusual because OCaml uses different operators for the float type.
The remainder operator stands out a bit, since it's not defined for floats directly. We first convert everything to int, apply the operation, and then convert back to float, since the only numeric type of our language uses float under the hood.
and eval_bool_arithmetic op l r =
match eval l, eval r with
| VBool a, VBool b -> VBool (op a b)
| _ -> raise (RuntimeError "Invalid type")
and eval_comparison op l r =
match eval l, eval r with
| VNumber a, VNumber b -> VBool (op a b)
| _ -> raise (RuntimeError "Invalid type")
and eval_arithmetic op l r =
match eval l, eval r with
| VNumber a, VNumber b -> VNumber (op a b)
| _ -> raise (RuntimeError "Invalid type")
The helper functions are very similar. The only thing that changes is the types they accept and the type they return.
In a dynamically typed language, this is where the rules are defined. For example, eval_bool_arithmetic could accept VNumber on both sides and define a rule where 0 is false, and anything else is true.
A small curiosity here: == and <> currently only work on numbers, so we can't run something like print(true == true). The other comparison operators don't necessarily make sense for booleans either. Though if we interpret false as 0 and true as 1, that could work. Yet another design decision left to the world's creator.
And with that, we've successfully introduced a new type into our language. If we run a program like:
print(1 + 1 * 2 - 8 < 3 and not 15 / 5 > 7 - 1 + 2 - 3)
We get the following output:
$ ocaml main.ml
true
But what happens if we try to run print(-6 < 0)?
$ ocaml main.ml
Parse error: Unexpected token 'TMinus'
Oh no! We don't support negative numbers in the source code.
As with everything, there are multiple ways to fix this. We can explicitly handle negative numbers (- followed by a number), or treat - as both a unary and binary operator (adding something like ENeg alongside ESub), which would allow expressions like -(6 - 5) as well. I went with the first approach, but implementing the second is a good exercise.
and parse_primary ps =
match peek ps with
(* ... *)
| TMinus ->
ignore (consume ps);
(match peek ps with
| TNumber n -> ignore (consume ps); ENumber (-1. *. n)
| t -> raise (ParseError (Printf.sprintf "Unexpected token '%s'" (show_token t))))
(* ... *)
Perhaps the only surprising part here is (-1. *. n), which comes from OCaml's slightly unusual relationship with floats.
If we rerun print(-6 < 0), we now get the correct result:
$ ocaml main.ml
true
The changes in this step can be found here.
Now that we have both boolean and numeric values, it would be nice to store them somewhere. Time to introduce variables.
Variables
On the lexer and parser side of things, there won't be too many surprises, but evaluation will get a bit spicier. Let's start, as usual, with the new tokens:
type token =
(* ... *)
| TEqual
(* ... *)
We already laid the groundwork in the lexer when we introduced the == operator. Now we just need to replace the error handling with the new behavior.
match current_char ls with
(* ... *)
| '=' ->
step ls;
(match current_char ls with
| '=' -> step ls; loop (TEq :: tokens)
| _ -> loop (TEqual :: tokens))
(* ... *)
The syntax tree gets a new expression for variables and a new statement for assignment.
type expr =
(* ... *)
| EVar of string
type stmt =
| SPrint of expr
| SAssign of string * expr
When handling primary expressions, any unknown identifier will now be treated as a variable, and evaluation will decide whether it actually exists.
and parse_primary ps =
match peek ps with
(* ... *)
| TIdentifier i -> ignore (consume ps); EVar i
(* ... *)
And in the statement parser, if something starts with an identifier, we assume it's an assignment.
and parse_stmt ps =
match peek ps with
(* ... *)
| TIdentifier t ->
ignore (consume ps);
expect ps TEqual;
SAssign (t, parse_expr ps)
(* ... *)
Now we can move on to the interesting part: evaluation. During execution, we need a place to store the variables and their values.
In a more serious programming language, this would be a much more complex issue, since we'd also need to handle scopes. But our little language hasn't reached that level of maturity yet, so a simple hash table is enough to store key-value pairs.
let make_env () =
Hashtbl.create 16
let set_env env name value =
Hashtbl.replace env name value
let lookup_env env name =
match Hashtbl.find_opt env name with
| Some v -> v
| None -> raise (RuntimeError (Printf.sprintf "Undefined variable: %s" name))
make_env creates an empty environment, set_env assigns a value to a variable, and lookup_env retrieves it (or fails if the variable doesn't exist).
Both eval and exec require fairly significant changes. They now need to take the env as an argument, which must be passed through all related function calls.
We also need to handle the new expression that reads a variable's value:
let rec eval env expr =
match expr with
(* ... *)
| EVar v -> lookup_env env v
(* ... *)
And the new statement assigns a value to a variable. The value on the right-hand side can itself be an expression, so it must be evaluated first.
let exec env stmt =
match stmt with
(* ... *)
| SAssign (name, value) ->
set_env env name (eval env value)
Due to the introduction of env, only minimal changes were needed in the "main" function:
let tokens = tokenize src in
let stmts = parse_program tokens in
let env = make_env () in
List.iter (exec env) stmts
And the output should, hopefully, look something like this:
$ ocaml main.ml
5
6
true
The changes in this step can be found here.
The language is shaping up nicely, but something is still missing... the cherry on top of our little grammatical cake.
Control flow
This post is already getting quite long, but it wouldn't really feel like a programming language without the good old if, elif, else, and while. Let's start by adding the new tokens. Besides the four keywords, the two curly braces are also new additions to the language.
type token =
(* ... *)
| TIf
| TElseIf
| TElse
| TWhile
| TLeftCurlyBracket
| TRightCurlyBracket
(* ... *)
We handle the new keywords in tokenize_name in the usual way.
match name with
(* ... *)
| "if" -> TIf
| "elif" -> TElseIf
| "else" -> TElse
| "while" -> TWhile
(* ... *)
And in tokenize, we add support for the new braces.
match current_char ls with
(* ... *)
| '{' -> step ls; loop (TLeftCurlyBracket :: tokens)
| '}' -> step ls; loop (TRightCurlyBracket :: tokens)
(* ... *)
After much of the same old, here's something new and exciting in the statement type: both if and while take lists of statements as parameters.
type stmt =
| SPrint of expr
| SAssign of string * expr
| SIf of expr * stmt list * stmt list
| SWhile of expr * stmt list
'And what about elif and else?' you might ask. Those are what you'd typically call syntactic sugar: convenience features for the language user that don't need to exist at the syntax tree level. A single construct, SIf(condition, true_statements, false_statements), can handle everything:
- if there is an
elifbranch, the false branch becomes a single-element list containing anotherSIf() - if there is no
elsebranch, the false branch of the finalSIf()is just an empty list
We also need to adjust how statements are parsed. Until now, we read statements up to TEOF, but now we also need to handle cases where statements stop at a closing brace.
let rec parse_stmts ps stop_token =
let rec loop stmts =
skip_newlines ps;
if (peek ps) <> stop_token then
loop (parse_stmt ps :: stmts)
else
stmts
in
List.rev (loop [])
We update parse_program accordingly:
let parse_program tokens =
let ps = make_parser tokens in
parse_stmts ps TEOF
And parse_stmt only needs a suspiciously small amount of modification:
and parse_stmt ps =
match peek ps with
| TIf ->
parse_if ps
(* ... *)
Then comes the actual implementation for handling if:
and parse_if ps =
ignore (consume ps);
expect ps TLeftParen;
let cond = parse_expr ps in
expect ps TRightParen;
expect ps TLeftCurlyBracket;
let then_branch = parse_stmts ps TRightCurlyBracket in
expect ps TRightCurlyBracket;
let else_branch = parse_else ps in
SIf (cond, then_branch, else_branch)
We first read the condition, which we expect to be enclosed in ( and ). Then we parse the statements that run if the condition evaluates to true, enclosed in { and }. The else branch is handled by a separate function.
and parse_else ps =
match peek ps with
| TElseIf ->
[ parse_if ps ]
| TElse ->
ignore (consume ps);
expect ps TLeftCurlyBracket;
let stmts = parse_stmts ps TRightCurlyBracket in
expect ps TRightCurlyBracket;
stmts
| _ -> []
Handling elif recursively calls back into parse_if, but wraps its result in a list. The else branch simply parses the statements after it.
and parse_stmt ps =
match peek ps with
(* ... *)
| TWhile ->
ignore (consume ps);
expect ps TLeftParen;
let cond = parse_expr ps in
expect ps TRightParen;
expect ps TLeftCurlyBracket;
let stmts = parse_stmts ps TRightCurlyBracket in
expect ps TRightCurlyBracket;
SWhile (cond, stmts)
(* ... *)
The while implementation is almost identical to if, except there is no else branch.
With that, parsing is complete, and we move on to evaluation. First, we slightly reorganize exec so it takes a list of statements as input, and the original logic is moved into exec_stmt.
let rec exec env stmts =
List.iter (exec_stmt env) stmts
and exec_stmt env stmt =
match stmt with
(* ... *)
| SIf (cond, then_branch, else_branch) ->
(match eval env cond with
| VBool true -> (exec env then_branch)
| VBool false -> (exec env else_branch)
| _ -> raise (RuntimeError "If condition must be a boolean"))
| SWhile (cond, stmts) ->
let rec loop () =
match eval env cond with
| VBool true ->
exec env stmts;
loop ()
| VBool false -> ()
| _ -> raise (RuntimeError "While condition must be a boolean")
in
loop ()
This change is immediately used in the handling of SIf and SWhile. For SIf, we use pattern matching to decide which branch to execute. For SWhile, we repeatedly execute the body recursively until the condition becomes false.
In the "main" function, we only need to adjust the call to exec.
let tokens = tokenize src in
let stmts = parse_program tokens in
let env = make_env () in
exec env stmts
And then we can run something that finally looks like real code:
fizzbuzz = -15
fizz = -3
buzz = -5
i = 1
while (i <= 15) {
if (i % 15 == 0) {
print(fizzbuzz)
} elif (i % 3 == 0) {
print(fizz)
} elif (i % 5 == 0) {
print(buzz)
} else {
print(i)
}
i = i + 1
}
Unfortunately, strings don't exist in our little world yet, so this was the best we could do, but the output speaks for itself. We're solving enterprise-grade problems with elegant simplicity.
$ ocaml main.ml
1
2
-3
4
-5
-3
7
8
-3
-5
11
-3
13
14
-15
The changes in this step can be found here.
'Tis the end
And so we've arrived here. 'The machine works, the creator rests.' By far the longest post ever published on this blog. Congratulations to everyone who made it this far. I hope I managed to cause a few late-night coding sessions.
And the final result? In barely 500 lines, we built something that can genuinely be called a programming language (even if a fairly limited one). It actually surprised me how compact it ended up being.
There are still countless extension possibilities and challenges hidden in this project, but a solid foundation has been built for further experimentation. Just to mention a few natural next steps:
- extending error handling (so that in both
ParseErrorandRuntimeErrorcases, we can point out the exact location in the original source code) - short-circuit evaluation for boolean expressions (if
aina and bevaluates to false,bdoesn't need to be evaluated) - functions (and the inevitable problem of variable scopes that comes with them)
- a string type, along with its operators, functions, and statements
- lists and dictionaries
With that, I'll wrap things up here. Thank you for reading.
TEOF