Practical parsing with Flex and Bison

Although parsing is often described from the perspective of writing a compiler, there are many common smaller tasks where it’s useful. Reading file formats, talking over the network, creating shells, and analyzing source code are all easier using a robust parser.

By taking time to learn general-purpose parsing tools, you can go beyond fragile homemade solutions, and inflexible third-party libraries. We’ll cover Lex and Yacc in this guide because they are mature and portable. We’ll also cover their later incarnations as Flex and Bison.

Above all, this guide is practical. We’ll see how to properly integrate parser generators into your build system, how to create thread-safe parsing modules, and how to parse real data formats. I’ll motivate each feature of the parser generator with a concrete problem it can solve. And, I promise, none of the typical calculator examples.

Table of contents

Lexical scanning

People usually use two stages to process structured text. The first stage, lexing (aka scanning), breaks the input into meaningful chunks of characters. The second, parsing, groups the scanned chunks following potentially recursive rules. However, a nice lexing tool like Lex can be useful on its own, even when not paired with a parser.

The simplest way to describe Lex is that it runs user-supplied C code blocks for regular expression matches. It reads a list of regexes and constructs a giant state machine which attempts to match them all “simultaneously.”

A lex input file is composed of three possible sections: definitions, rules, and helper functions. The sections are delimited by %%. Lex transforms its input file into a plain C file that can be built using an ordinary C compiler.

Here’s an example. We’ll match the strings cot, cat, and cats. Our actions will print a replacement for each.

/* catcot.l */

%{
#include <stdio.h>
%}

%%

cot { printf("portable bed"); }
cat { printf("thankless pet"); }
cats { printf("anti-herd"); }

To build it:

# turn the input into an intermediate C file
lex -t catcot.l > catcot.c

# compile it
cc -o catcot catcot.c -ll

(Alternately, build it in one step with make catcot. Even in the absence of a Makefile, POSIX make has suffix rules that handle .l files.)

The program outputs simple substitutions:

echo "the cat on the cot joined the cats" | ./catcot
the thankless pet on the portable bed joined the anti-herd

The reason it prints non-matching words (such as “the”) is that there’s an implicit rule matching any character (.) and echoing it. In most real parsers we’ll want to override that.

Here’s what’s happening inside the scanner. Lex reads the regexes and generates a state machine to consume input. Below is a visualization of the states, with transitions labeled by input character. The circles with a double outline indicate states that trigger actions.

Note there’s no notion of word boundaries in our lexer, it’s operating on characters alone. For instance:

echo "catch!" | ./catcot
thankless petch!

That sounds rather like an insult.

An important subtlety is how Lex handles multiple eligible matches. It picks the longest possible match available, and in the case of a tie, picks the matching pattern defined earliest.

To illustrate, suppose we add a looser regex, c.t, first.

%%
c.t { printf("mumble mumble"); } 
cot { printf("portable bed"); }
cat { printf("thankless pet"); }
cats { printf("anti-herd"); }

Lex detects that the rule masks cat and cot, and outputs a warning:

catcot.l:10: warning, rule cannot be matched
catcot.l:11: warning, rule cannot be matched

It still compiles though, and behaves like this:

echo "the cat on the cot joined the cats" | ./catcot
the mumble mumble on the mumble mumble joined the anti-herd

Notice that it still matched cats, because cats is longer than c.t.

Compare what happens if we move the loose regex to the end of our rules. It can then pick up whatever strings get past the others.

%%
cot { printf("portable bed"); }
cat { printf("thankless pet"); }
cats { printf("anti-herd"); }
c.t { printf("mumble mumble"); }

It acts like this:

echo "cut the cot" | ./catcot
mumble mumble the portable bed

Now’s a good time to take a detour and observe how our user-defined code acts in the generated C file. Lex creates a function called yylex(), and inserts the code blocks verbatim into a switch statement. When using lex with a parser, the parser will call yylex() to retrieve tokens, named by integers. For now, our user-defined code isn’t returning tokens to a parser, but doing simple print statements.

/* catcot.c (generated by lex) */

int yylex (void)
{
	/* ... */
	switch ( yy_act )
	{
		/* ... */

		case 1:
		YY_RULE_SETUP
		#line 9 "catcot.l"
		{ printf("portable bed"); }
			YY_BREAK
		case 2:
		YY_RULE_SETUP
		#line 10 "catcot.l"
		{ printf("thankless pet"); }
			YY_BREAK
		case 3:
		YY_RULE_SETUP
		#line 11 "catcot.l"
		{ printf("anti-herd"); }
			YY_BREAK

		/* ... */
	}
	/* ... */
}

As mentioned, a lex file is comprised of three sections:

DEFINITIONS

%%

RULES

%%

HELPER FUNCTIONS

The definitions section is where you can embed C code to include headers and declare functions used in rules. The definitions section can also define friendly names for regexes that can be reused in the rules.

The rules section, as we saw, contains a list of regexes and associated user code.

The final section is where to put the full definitions of helper functions. This is also where you’d put the main() function. If you omit main(), the Lex library provides one that simply calls yylex(). This default main() implementation (and implementations for a few other functions) is available by linking your lex-generated C code with -ll compiler flag.

Let’s see a short, fun example: converting Roman numerals to decimal. Thanks to lex’s behavior of matching longer strings first, it can read the single-letter numerals, but look ahead for longer subtractive forms like “IV” or “XC.”

/* roman-lex.l */

/* the %{ ... %} enclose C blocks that are copied
   into the generated code */

%{
#include <stdio.h>
#include <stdlib.h>

/* globals are visible to user actions amd main() */

int total;
%}

%%

 /*<- notice the whitespace before this comment, which
      is necessary for comments in the rules section */

 /* the basics */

I  { total +=    1; }
V  { total +=    5; }
X  { total +=   10; }
L  { total +=   50; }
C  { total +=  100; }
D  { total +=  500; }
M  { total += 1000; }

 /* special cases match with preference
    because they are longer strings */

IV { total +=    4; }
IX { total +=    9; }
XL { total +=   40; }
XC { total +=   90; }
CD { total +=  400; }
CM { total +=  900; }

 /* ignore final newline */

\n ;

 /* but die on anything else */

.  {
	fprintf(stderr, "unexpected: %s\n", yytext);
	exit(EXIT_FAILURE);
}

%%

/* provide our own main() rather than the implementation
   from lex's library linked with -ll */

int main(void)
{
	/* only have to call yylex() once, since our
	   actions don't return */
	yylex();

	fprintf(yyout, "%d\n", total);
	return EXIT_SUCCESS;
}

More realistic scanner

Now that we’ve seen Lex’s basic operation in the previous section, let’s consider a useful example: syntax highlighting. Detecting keywords in syntax is a problem that lex can handle by itself, without help from yacc.

Because lex and yacc are so old (predating C), and used in so many projects, you can find grammars already written for most languages. For instance, we’ll take quut’s C specification for lex, and modify it to do syntax highlighting.

This relatively short program accurately handles the full complexity of the language. It’s easiest to understand by reading in full. See the inline comments for new and subtle details.

/* c.l syntax highlighter */

%{
/* POSIX for isatty, fileno */
#define _POSIX_C_SOURCE 200112L

#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>

/* declarations are visible to user actions */

enum FG
{
	fgRED      = 31,   fgGREEN    = 32,
	fgORANGE   = 33,   fgCYAN     = 36,   
	fgDARKGREY = 90,   fgYELLOW   = 93
};

void set_color(enum FG);
void reset_color(void);
void color_print(enum FG, const char *);

void consume_comment(void);
%}

/* named regexes we can use in rules */

O   [0-7]
D   [0-9]
NZ  [1-9]
L   [a-zA-Z_]
A   [a-zA-Z_0-9]
H   [a-fA-F0-9]
HP  (0[xX])
E   ([Ee][+-]?{D}+)
P   ([Pp][+-]?{D}+)
FS  (f|F|l|L)
IS  (((u|U)(l|L|ll|LL)?)|((l|L|ll|LL)(u|U)?))
CP  (u|U|L)
SP  (u8|u|U|L)
ES  (\\(['"\?\\abfnrtv]|[0-7]{1,3}|x[a-fA-F0-9]+))
WS  [ \t\v\n\f]

%%

 /* attempting to match and capture an entire multi-line
    comment could strain lex's buffers, so we match the
    beginning, and call consume_comment() to deal with
    the ensuing characters, in our own less resource-
    intensive way */

"/*"      {
	set_color(fgDARKGREY);

	/* For greater flexibility, we'll output to lex's stream, yyout.
	   It defaults to stdout. */
	fputs(yytext, yyout);

	consume_comment();
	reset_color();
}

 /* single-line comments can be handled the default way.
    The yytext variable is provided by lex and points
    to the characters that match the regex */

"//".*    {
	color_print(fgDARKGREY, yytext);
}

^[ \t]*#.*      {
	color_print(fgRED, yytext);
}

 /* you can use the same code block for multiple regexes */

auto     |
bool     |
char     |
const    |
double   |
enum     |
extern   |
float    |
inline   |
int      |
long     |
register |
restrict |
short    |
size_t   |
signed   |
static   |
struct   |
typedef  |
union    |
unsigned |
void     |
volatile |
_Bool    |
_Complex {
	color_print(fgGREEN, yytext);
}

break    |
case     |
continue |
default  |
do       |
else     |
for      |
goto     |
if       |
return   |
sizeof   |
switch   |
while    {
	color_print(fgYELLOW, yytext);
}

 /* we use the named regexes heavily below; putting
    them in curly brackets expands them */

{L}{A}*  {

	/* without this rule, keywords within larger words
	   would be highlighted, like the "if" in "life" --
	   this rule prevents that because it's a longer match */

	fputs(yytext, yyout);
}

{HP}{H}+{IS}?               |
{NZ}{D}*{IS}?               |
"0"{O}*{IS}?                |
{CP}?"'"([^'\\\n]|{ES})+"'" |
{D}+{E}{FS}?                |
{D}*"."{D}+{E}?{FS}?        |
{D}+"."{E}?{FS}?            |
{HP}{H}+{P}{FS}?            |
{HP}{H}*"."{H}+{P}{FS}?     |
{HP}{H}+"."{P}{FS}?         {
	color_print(fgCYAN, yytext);
}

({SP}?\"([^"\\\n]|{ES})*\"{WS}*)+ {
	color_print(fgORANGE, yytext);
}

 /* explicitly mention the default rule */

. ECHO;

%%

/* definitions of the functions we declared earlier */

/* the color functions use ANSI escape codes, and may
   not be portable across all terminal emulators. */

void set_color(enum FG c)
{
	fprintf(yyout, "\033[%d;1m", c);
}

void reset_color(void)
{
	fputs("\033[0m", yyout);
}

void color_print(enum FG c, const char *s)
{
	set_color(c);
	fputs(s, yyout);
	reset_color();
}

/* this function directly consumes characters in lex
   using the input() function. It pulls characters
   from the same stream that the regex state machine
   reads. */
void consume_comment(void)
{
	int c;

	/* EOF in lex is 0, which is different from
	   the EOF macro in the C standard library */
	while ((c = input()) != 0)
	{
		putchar(c);
		if (c == '*')
		{
			while ((c = input()) == '*')
				putchar(c);
			if (c == 0) break;
			putchar(c);
			if (c == '/') return;
		}
	}
}

int main(void)
{
	if (!isatty(fileno(stdout)))
	{
		/* a more flexible option would be to make the
		   color changing functions do nothing, but that's
		   too much fuss for an example program */

		fputs("Stdout is not a terminal\n", stderr);
		return EXIT_FAILURE;
	}
	/* since we'll be changing terminal color, be sure to
	   reset it for any program termination event */
	atexit(reset_color);

	/* let our lex rules do the rest */
	yylex();
	return EXIT_SUCCESS;
}

Using a scanner as a library

One of the biggest areas of improvement between classic lex/yacc and flex/bison is the ability of the latter to generate code that’s easier to embed into a larger application. Lex and yacc are designed to create standalone programs, with user-defined code blocks stuck inside. When classic lex and yacc work together, they use a bunch of global variables.

Flex and Bison, on the other hand, can generate thread-safe functions with uniquely prefixed names that can be safely linked into larger programs. To demonstrate, we’ll do another scanner (with Flex this time).

The following Rube Goldberg contraption uses Flex to split words on whitespace and call a user-supplied callback for each word. There’s certainly an easier non-Flex way to do this task, but this example illustrates how to encapsulate Flex code into a reusable library.

/* words.l */

/* don't generate functions we don't need */
%option nounput noinput noyywrap

/* generate a scanner that's thread safe */
%option reentrant

/* Generate "words" rather than "yy" as a prefix, e.g.
   wordslex() rather than yylex(). This allows multiple
   Flex scanners to be linked with the same application */
%option prefix="words"

%%

[^ \t\n]+ {
	/* the return statement causes yylex to stop and return */
	return 1; /* our code for a word token */
}

  /* do nothing for any other characters, don't
     output them as would be the default behavior */
.|\n	  ;

%%

/* Callers interact with this function, which neatly hides
   the Flex inside.

   Also, we'll call "yy" functions like "yylex()" inside,
   and Flex will rename them in the resulting C file to
   calls with the "words" prefix, like "wordslex()"

   Zero return means success, nonzero is a Flex error
   code. */

int words_callback(char *s, void (*f)(const char *))
{
	/* in the reentrant mode, we maintain our
	   own scanner and its associated state */
	int i;
	yyscan_t scanner;
	YY_BUFFER_STATE buf;

	if ((i = yylex_init(&scanner)) != 0)
		return i;

	/* read from a string rather than a stream */
	buf = yy_scan_string(s, scanner);

	/* Each time yylex finds a word, it returns nonzero.
	   It resumes where it left off when we call it again */
	while ((i = yylex(scanner)) > 0)
	{
		/* call the user supplied function f with
		   yytext of the match */
		f(yyget_text(scanner));
	}

	/* clean up */
	yy_delete_buffer(buf, scanner);
	yylex_destroy(scanner);
	return 0;
}

Build it like this:

# generate scanner, build object file
flex -t words.l > words.c
cc -c words.c

# verify that all public text symbols are prefixed by "words"
nm -g words.o | grep " T "

A calling program can use our library without seeing any Flex internals.

/* test_words.c */

#include <stdio.h>

/* words_callback defined in the object file -- you could put
   this declaration in a header file words.h */
int words_callback(char *, void (*)(const char *));

void print_word(const char *w)
{
	puts(w);

	/* if you want to use the parameter w in the future, you
	   need to duplicate it in memory whose lifetime you control */
}

int main(void)
{
	words_callback(
		"The quick brown fox\n"
		"jumped over the lazy dog\n",
		&print_word
	);
	return 0;
}

To build the program, just link it with words.o.

cc -o test_words test_words.c words.o

Parsing

Now that we’ve seen how to identify tokens with a scanner, let’s learn how a parser can act on the tokens using recursive rules. Yacc/byacc/bison are LALR (look-ahead left recursive) parsers, and Bison supports more powerful modes if desired.

Mental model of LR parsing

LR parsers build bottom-up toward a goal, shifting tokens onto a stack and combining (“reducing”) them according to rules. It’s helpful to get a mental model for this process, so let’s jump into a simple example and simulate what yacc does.

Here’s a yacc grammar with a single rule to build a result called foo. We specify that foo is comprised of lex tokens A, B, and C.

%token A B C

%%

foo: A B C

Yacc transforms the grammar into a state machine which looks like this:

The first rule in the file (and the only rule in our case) becomes yacc’s goal. Yacc begins in state 0, with the implicit rule 0: $accept: • foo $end. The parse will be accepted if we can produce a foo followed immediately by the end of input. The bullet point indicates our progress reading the input. In state 0 it’s at the beginning, meaning we haven’t read anything yet.

Initially there’s no lookahead token, so yacc calls yylex() to get one. If lex produces an A, we follow the state transition to state 1. Because the arrow is a solid line, not dashed, yacc “shifts” the token to its token stack. It also pushes state 1 onto a state stack, which now holds states 0 and 1.

State 1 is trying to satisfy the rule which it calls rule 1, namely 1 foo: A • B C. The bullet point after the A indicates we’ve seen the A already. Don’t confuse the state numbers and rule numbers – yacc numbers them independently.

Yacc continues processing input, shifting tokens and moving to states 3 and 5 if lex produces the expected tokens. If, at any point, lex produces a token not matching any transitions in the current state, then yacc reports a syntax error and terminates. (There’s a way to do error recovery, but that’s another topic.)

State 5 has seen all necessary tokens for rule 1: 1 foo: A B C •. Yacc continues to the diamond marked “R1,” which is a reduction action. Yacc “reduces” rule 1, popping the A, B, C terminal tokens off the stack and pushing a single non-terminal foo token. When it pops the three tokens, it pops the same number of states (states 5, 3, and 1). Popping three states lands us back in state 0.

State 0 has a dashed line going to state 2 that matches the foo token that was just reduced. The dashed line means “goto” rather than “shift,” because rule 0 doesn’t have to shift anything onto the token stack. The previous reduction already took care of that.

Finally, state 2 asks lex for another token, and if lex reports EOF, that matches $end and sends us to state 4, which ties a ribbon on it with the Acc(ept) action.

From what we’ve seen so far, each state may seem to be merely tracking progress through a single rule. However, states actually track all legal ways forward from tokens previously consumed. A single state can track multiple candidate rules. For instance:

%token A B C

%%

 /* foo is either x or y */

foo: x | y;

 /* x and y both start with an A */

x: A B;

y: A C;

For this grammar, yacc produces the following state machine:

In state 1 we’ve seen token A, and so rules 3 and 4 are both in the running to reduce an x or y. On a B or C token, the possibilities narrow to a single rule (in state 5 or 6).

Also notice that our rule foo : x | y doesn’t occur verbatim in any states. Yacc separates it into 1 foo: x and 2 foo: y. Thus, the numbered rules don’t always match the rules in the grammar one-to-one.

Yacc can also use peek ahead by one token to choose which rule to reduce, without shifting the “lookahead” token. In the following grammar, rules x and y match the same tokens. However, the foo rule can say to choose x when followed by a B, or y when followed by a C:

%token A B C

%%

foo : x B | y C;

x : A;

y : A;

Note multiple reductions coming out of state 1 in the generated state machine:

The presence of a bracketed token ([C]) exiting state 1 indicates that the state uses lookahead. If the state sees token C, it reduces rule 4. Otherwise it reduces rule 3. Lookahead tokens remain to be read when following a dashed-line (goto) action, such as from state 0 to state 4.

Ambiguous grammars

While yacc is a powerful tool to transform a grammar into a state machine, it may not operate the way you intend on ambiguous grammars. These are grammars with a state that could proceed in more than one way with the same input.

As grammars get complicated, it’s quite possible to create ambiguities. Let’s look at small examples that make it easier to see the mechanics of the conflict. That way, when it happens in a real grammar, we’ll have a better feeling for it.

In the following example, the input A B matches both x and y B. There’s no reason for yacc to choose one construction over the other when reducing to foo. So why does this matter, you ask? Don’t we get to foo either way? Yes, but real parsers will have different user code assigned to run per rule, and it matters which code block gets executed.

%token A B

%%

foo : x | y B ;

x : A B ;

y : A ;

The state machine shows ambiguity at state 1:

At state 1, when the next token is B, the state could shift the token and enter state 5 (attempting to reduce x). It could also reduce y and leave B as lookahead. This is called a shift/reduce conflict. Yacc’s policy in such a conflict is to favor a shift over a reduce.

Alternately, we can construct a grammar with a state that has more than one eligible reduction for the same input. The purest toy example would be foo : A | A, generating:

In a reduce/reduce conflict, yacc chooses to reduce the conflicting rule presented earlier in the grammar.

Constructing semantic values

While matching tokens, parsers typically build a user-defined value in memory to represent features of the input. Once the parse reaches the goal state and succeeds, then the user code will act on the memory value (or pass it along to a calling program).

Yacc has stores the semantic values from parsed tokens in variables ($1, $2, …) accessible to code blocks, and it provides a variable ($$) for assigning the semantic result of the current code block.

Let’s see it in action. We won’t do a hackneyed calculator, but let’s still make a parser that operates on integers. Integer values allow us to avoid thinking about memory management.

We’ll revisit the roman numeral example, and this time let lex match the digits while yacc combines them into a final result. It’s actually more cumbersome than our earlier way, but illustrates how to work with semantic parse values.

There are some comments in the example below about portability between yacc variants. The three most prominent variants, in order of increasing features, are: the POSIX interface matching roughly the AT&T yacc functionally, byacc (Berkeley Yacc), and GNU Bison.

/* roman.y  (plain yacc) */

%{
#include <stdio.h>

/* declarations to fix warnings from sloppy
   yacc/byacc/bison code generation. For instance,
   the code should have a declaration of yylex. */

int yylex(void);

/* The POSIX specification says yyerror should return
   int, although bison documentation says the value is
   ignored. We match POSIX just in case. */

int yyerror(const char *s);
%}

/* tokens our lexer will produce */
%token NUM

%%

/* The first rule is the final goal. Yacc will work
   backward trying to arrive here. This "results" rule
   is a stub we use to print the value from "number." */

results :
  number { fprintf(yyout, "%d\n", $1); }
;

/* as the lexer produces more NUMs, keep adding them */

number :

  /* this is a common pattern for saying number is one or
     more NUMs.  Notice we specify "number NUM" and not
     "NUM number". In yacc recursion, think "right is wrong
     and left is right." */

  number NUM { $$ = $1 + $2; }

  /* base case, using default rule of $$ = $1 */

| NUM
;

The corresponding lexer matches individual numerals, and returns them with their semantic values.

/* roman.l */

%{
/* The .tab.h file is generated by yacc, and we'll explain
   it later */

#include "roman.tab.h"

/* lex communicates semantic token values to yacc through
   a shared global variable */

extern int yylval;
%}

/* when using flex (rather than vanilla lex) fix
   unused function warnings by adding:

%option noinput nounput
*/

%%

 /* The constant for NUM comes from roman.tab.h,
    and was generated because we declared
    "%token NUM" in roman.y */

I  { yylval =    1; return NUM; }
V  { yylval =    5; return NUM; }
X  { yylval =   10; return NUM; }
L  { yylval =   50; return NUM; }
C  { yylval =  100; return NUM; }
D  { yylval =  500; return NUM; }
M  { yylval = 1000; return NUM; }

IV { yylval =    4; return NUM; }
IX { yylval =    9; return NUM; }
XL { yylval =   40; return NUM; }
XC { yylval =   90; return NUM; }
CD { yylval =  400; return NUM; }
CM { yylval =  900; return NUM; }

 /* ignore final newline */
\n ;

 /* As a default action, return the ascii value of
    the character as if it were a token identifier.
    The values from roman.tab.h are offset above 256 to
    be above any ascii value, so there's no ambiguity

    Our parser won't be expecting these values, so
    they will lead to a syntax error */
.  { return *yytext; }

To review: lex generates a yylex() function, and yacc generates yyparse() that calls yylex() repeatedly to get new token identifiers. Lex actions copy semantic values to yylval which Yacc copies into $-variables accessible in parser rule actions.

Building an executable roman from the input files roman.y and roman.l requires explanation. With appropriate command line flags, yacc will create the files roman.tab.c and roman.tab.h from roman.y. Lex will create roman.lex.c from roman.l, using token identifiers in roman.tab.h.

In short, here are the build dependencies for each file:

And here’s how to express it all in a Makefile.

# put together object files from lexer and parser, and
# link the yacc and lex libraries (in that order, to pick
# main() from yacc's library rather than lex's)

roman : roman.tab.o roman.lex.o
	$(CC) -o $@ roman.tab.o roman.lex.o -ly -ll

# tell make which files yacc will generate
#
# an explanation of the arguments:
# -b roman  -  name the files roman.tab.*
# -d        -  generate a .tab.h file too

roman.tab.h roman.tab.c : roman.y
	$(YACC) -d -b roman $?

# the object file relies on the generated lexer, and
# on the token constants 

roman.lex.o : roman.tab.h roman.lex.c

# can't use the default suffix rule because we're
# changing the name of the output to .lex.c

roman.lex.c : roman.l
	$(LEX) -t $? > $@

And now, the moment of truth:

$ make

$ echo MMMCMXCIX | ./roman
3999

Using a parser as a library

In this example we’ll parse LISP S-expressions, limited to string and integer atoms. There’s more going on in this one, such as memory management, different semantic types per token, and packaging the lexer and parser together into a single thread-safe library. This example requires Bison.

/* lisp.y  (requires Bison) */

/* a "pure" api means communication variables like yylval
   won't be global variables, and yylex is assumed to
   have a different signature */

%define api.pure true

/* change prefix of symbols from yy to "lisp" to avoid
   clashes with any other parsers we may want to link */

%define api.prefix {lisp}

/* generate much more meaningful errors rather than the
   uninformative string "syntax error" */

%define parse.error verbose

/* Bison offers different %code insertion locations in
   addition to yacc's %{ %} construct.

   The "top" location is good for headers and feature
   flags like the _XOPEN_SOURCE we use here */

%code top {
	/* XOPEN for strdup */
	#define _XOPEN_SOURCE 600
	#include <stdio.h>
	#include <stdlib.h>
	#include <string.h>

	/* Bison versions 3.7.5 and above provide the YYNOMEM
	   macro to allow our actions to signal the unlikely
	   event that they couldn't allocate memory. Thanks
	   to the Bison team for adding this feature at my
	   request. :) YYNOMEM causes yyparse() to return 2.

	   The following conditional define allows us to use
	   the functionality in earlier versions too. */

	#ifndef YYNOMEM
	#define YYNOMEM goto yyexhaustedlab
	#endif
}

/* The "requires" code location is designed for defining
   data types that we can use as yylval's for tokens. Code
   in this section is also added to the .tab.h file for
   inclusion by calling code */

%code requires {
	enum sexpr_type {
		SEXPR_ID, SEXPR_NUM, SEXPR_PAIR, SEXPR_NIL
	};

	struct sexpr
	{
		enum sexpr_type type;
		union
		{
			int   num;
			char *id;
		} value;
		struct sexpr *left, *right;
	};
}

/* These are the semantic types available for tokens,
   which we name num, str, and node.

   The %union construction is classic yacc as well. It
   generates a C union and sets its as the YYSTYPE, which
   will be the type of yylval */

%union
{
	int num;
	char *str;
	struct sexpr *node;
}

/* Add another argument in yyparse() so that we
   can communicate the parsed result to the caller.
   We can't return the result directly, since the
   return value is already reserved as an int, with
   0=success, 1=error, 2=nomem

   NOTE
   In our case, the param is a data pointer. However,
   if it were a function pointer (such as a callback),
   then its type would have to be put behind a typedef,
   or else parse-param will mangle the declaration. */

%parse-param {struct sexpr **result}

/* param adds an extra param to yyparse (like parse-param)
   but also causes yyparse to send the value to yylex.
   In our case the caller will initialize their own scanner
   instance and pass it through */

%param {void *scanner}

/* the "provides" location adds the code to our generated
   parser, but also to the .tab.h file for use by callers */

%code provides {
	void sexpr_free(struct sexpr *s);
}

/* unqualified %code is for internal use, things that
   our actions can see. These declarations prevent
   warnings.  Notice the final param in each that came
   from the %param directive above */

%code {
	int lisperror(void *foo, char const *msg, const void *s);
	int lisplex(void *lval, const void *s);
}

/* Now when we declare tokens, we add their type
   in brackets. The type names come from our %union */

%token <str> ID
%token <num> NUM

/* whereas tokens come from the lexer, these
   non-terminals are defined in the parser, and
   we set their types with %type */

%type <node> start sexpr pair list members atom

/* if there's an error partway through parsing, the
   caller wouldn't get a chance to free memory for
   the work in progress. Bison will clean up the memory
   if we provide destructors, though. */

%destructor { free($$); } <str>
%destructor { sexpr_free($$); } <node>

%%

 /* once again we use a dummy non-terminal to perform
    a side-effect, in this case setting *result */

start :
  sexpr   { *result = $$ = $1; return 0; }
;

sexpr :
  atom
| list
| pair
;

list :

  /* This is a shortcut: we use the ascii value for
     parens '('=40, ')'=41 as their token codes.
     Thus we don't have to define a bunch of crap
     manually like LPAREN, RPAREN */

  '(' members ')' { $$ = $2; }

| '('')' {
	struct sexpr *nil = malloc(sizeof *nil);
	if (!nil) YYNOMEM;
	*nil = (struct sexpr){.type = SEXPR_NIL};
	$$ = nil;
  }
;

members :
  sexpr {
	struct sexpr *s = malloc(sizeof *s),
				 *nil = malloc(sizeof *nil);
	if (!s || !nil) {
		free(s), free(nil);
		YYNOMEM;
	}
	*nil = (struct sexpr){.type = SEXPR_NIL};

	/* convention: we assume that a previous parser
	   value like $1 is non-NULL, else it would have
	   died already with YYNOMEM. We're responsible
	   for checking only our own allocations */

	*s = (struct sexpr){
		.type = SEXPR_PAIR,
		.left = $1,
		.right = nil
	};
	$$ = s;
  }
| sexpr members {
	struct sexpr *s = malloc(sizeof *s);

	/* Another important memory convention: we
	   can't trust that our lexer successfully
	   allocated its yylvalue, because the signature
	   of yylex doesn't communicate failure. We
	   assume NULL in $1 means alloc failure and
	   we report that. The only other way to signal
	   from yylex would be to make a fake token to
	   represent out-of-memory, but that's harder */

	if (!s) YYNOMEM;
	*s = (struct sexpr){
		.type = SEXPR_PAIR,
		.left = $1,
		.right = $2
	};
	$$ = s;
  }
;

pair :
  '(' sexpr '.' sexpr ')' {
	struct sexpr *s = malloc(sizeof *s);
	if (!s) YYNOMEM;
	*s = (struct sexpr){
		.type = SEXPR_PAIR,
		.left = $2,
		.right = $4
	};
	$$ = s;
  }
;

atom :
  ID {
	if (!$1) YYNOMEM;

	struct sexpr *s = malloc(sizeof *s);
	if (!s) YYNOMEM;
	*s = (struct sexpr){
		.type = SEXPR_ID,
		.value.id = strdup($1)
	};
	if (!s->value.id)
	{
		free(s);
		YYNOMEM;
	}
	$$ = s;
  }
| NUM {
	struct sexpr *s = malloc(sizeof *s);
	if (!s) YYNOMEM;
	*s = (struct sexpr){
		.type = SEXPR_NUM,
		.value.num = $1
	};
	$$ = s;
  }
;

%%

/* notice the extra parameters required
   by %param and %parse-param */

int lisperror(void *yylval, char const *msg, const void *s)
{
	(void)yylval;
	(void)s;
	return fprintf(stderr, "%s\n", msg);
}

/* useful internally by us, and externally by callers */

void sexpr_free(struct sexpr *s)
{
	if (!s)
		return;
	
	if (s->type == SEXPR_ID)
		free(s->value.id);
	else if (s->type == SEXPR_PAIR)
	{
		sexpr_free(s->left);
		sexpr_free(s->right);
	}
	free(s);
}

The parser does the bulk of the work. We just need to pair it with a scanner that reads atoms and parens.

/* lisp.l */

/* disable unused functions so we don't
   get compiler warnings about them */

%option noyywrap nounput noinput
%option noyyalloc noyyrealloc noyyfree

/* change our prefix from yy to lisp */

%option prefix="lisp"

/* use the pure parser calling convention */

%option reentrant bison-bridge

%{
#include "lisp.tab.h"

#define YY_EXIT_FAILURE ((void)yyscanner, EXIT_FAILURE)

/* XOPEN for strdup */
#define _XOPEN_SOURCE 600
#include <limits.h>
#include <stdlib.h>
#include <string.h>

/* seems like a bug that I have to do this, since flex
   should know prefix=lisp and match bison's LISPSTYPE */
#define YYSTYPE LISPSTYPE

int lisperror(const char *msg);
%}

%%

[[:alpha:]][[:alnum:]]* {
	/* The memory that yytext points to gets overwritten
	   each time a pattern matches. We need to give the caller
	   a copy. Also, if strdup fails and returns NULL, it's up
	   to the caller (the parser) to detect that.

	   Notice yylval is a pointer to union now.  It's passed
	   as an arg to yylex in pure parsing mode */

	yylval->str = strdup(yytext);
	return ID;
}

[-+]?[[:digit:]]+ {
	long n = strtol(yytext, NULL, 10);

	if (n < INT_MIN || n > INT_MAX)
		lisperror("Number out of range");
	yylval->num = (int)n;
	return NUM;
}

[[:space:]]  ; /* ignore */

/* this is a handy rule to return the ASCII value
   of any other character. Importantly, parens */

. { return *yytext; }

Finally, here’s how to call the parser from a regular program.

/* driver_lisp.c */

#include <stdio.h>
#include <stdlib.h>

#define YYSTYPE LISPSTYPE
#include "lisp.tab.h"
#include "lisp.lex.h"

void sexpr_print(struct sexpr* s, unsigned depth)
{
	for (unsigned i = 0; i < depth; i++)
		printf("  ");
	switch (s->type)
	{
		case SEXPR_ID:
			puts(s->value.id);
			break;
		case SEXPR_NUM:
			printf("%d\n", s->value.num);
			break;
		case SEXPR_PAIR:
			puts(".");
			sexpr_print(s->left, depth+1);
			sexpr_print(s->right, depth+1);
			break;
		case SEXPR_NIL:
			puts("()");
			break;
		default:
			abort();
	}
}

int main(void)
{
	int i;
	struct sexpr *expr;
	yyscan_t scanner;

	if ((i = lisplex_init(&scanner)) != 0)
		exit(i);

	int e = lispparse(&expr, scanner);
	printf("Code = %d\n", e);
	if (e == 0 /* success */)
	{
		sexpr_print(expr, 0);
		sexpr_free(expr);
	}

	lisplex_destroy(scanner);
	return 0;
}

To build it, use the Makefile pattern from roman to create analogous lisp.lex.o and lisp.tab.o. This example requires Flex and Bison, so set LEX=flex and YACC=bison at the top of the Makefile to override whatever system defaults are used for these programs. Finally, compile driver_lisp.c and link with those object files.

Here’s the program in action:

$ echo "(1 () (2 . 3) (4))" | ./driver_lisp
Code = 0
.
  1
  .
    ()
    .
      .
        2
        3
      .
        .
          4
          ()
        ()

Designing against an RFC

Internet Request For Comment (RFC) documents describe the syntax of many protocols and data formats. They often include complete Augmented Backus-Naur Form (ABNF) grammars, which we can convert into robust yacc parsers.

Let’s examine RFC4181, which describes the comma-separated value (CSV) format. It’s pretty simple, but has problematic edge cases: commas in quoted values, quoted quotes, raw newlines in quoted values, and blank-as-a-value.

Here’s the full grammar from the RFC. Notice how alternatives are specified with “/” rather than “|”, and how ABNF has the constructions *(zero-or-more-things) and [optional-thing]:

file = [header CRLF] record *(CRLF record) [CRLF]

header = name *(COMMA name)

record = field *(COMMA field)

name = field

field = (escaped / non-escaped)

escaped = DQUOTE *(TEXTDATA / COMMA / CR / LF / 2DQUOTE) DQUOTE

non-escaped = *TEXTDATA

COMMA = %x2C

CR = %x0D

DQUOTE =  %x22

LF = %x0A

CRLF = CR LF

TEXTDATA =  %x20-21 / %x23-2B / %x2D-7E

The grammar makes no distinction between lexing and parsing, although the uppercase identifiers hint at lexer tokens. While it may be tempting to translate to yacc top-down, starting at the file level, I’ve found the most productive way is to start with lexing.

We can combine most of the grammar into two lex rules to match fields:

%%

\"([^"]|\"\")*\" {
	/* this is what the ABNF calls "escaped" */

	/* TODO: copy un-escaped internals to yylval */

	return FIELD;
}

[^",\r\n]+ {
	/* This is *almost* what the ABNF calls "un-escaped,"
	   except it won't match an empty field, like
	   a,,b
	    ^---- this

	   Actually, even if we tried matching an empty string,
	   the comma or crlf would prove a longer match and
	   trump this one.
	*/

	/* TODO: capture the value to yylval */

	/* no need to bother yacc with two token types, we
	   call them both FIELD. */
	return FIELD;
}

 /* handle both UNIX and DOS style, per the spec */
\n|\r\n    { return CRLF; }

 /* catch the comma, and any other unexpected thing */
.          { return *yytext; }

With FIELD out of the way, here’s what’s left to translate:

file = [header CRLF] record *(CRLF record) [CRLF]

header = name *(COMMA name)

record = field *(COMMA field)

name = field

Let’s also drop the designation of the first row as the “header.” The application can choose to treat the first ordinary row as a header if desired. This simplifies the grammar to:

file = record *(CRLF record) [CRLF]

record = field *(COMMA field)

At this point it’s easy to convert to yacc.

%token CRLF FIELD

%%

file :
  record
| file CRLF record
;

record :
  field.opt
| record ',' field.opt
;

 /* Here is where we handle the potentially blank
    non-escaped FIELD. The ".opt" suffix doesn't mean
    anything to yacc, it's just a reminder for us that
    this *may* match a FIELD, or nothing at all */
field.opt :
  /* empty */
| FIELD
;

Matching blank fields is tricky. There are three fields in a,,b, no way around it. That means we have to identify some value (either a non-terminal symbol, or a terminal token) out of thin air between characters of input. As a corollary, given that we have to honor blank fields as existing, we’re forced to interpret e.g. a 0-byte file as one record with a single blank field.

We handled the situation with an empty yacc rule in field.opt. Empty rules allow the parser to reduce when it sees unexpected lookahead tokens. Perhaps it’s also possible to use fancy tricks in the lexer (like trailing context and start conditions) to also match empty non-escaped fields. However, I think an empty parser rule is more elegant.

Three notes about empty rules:

We wrote the empty rule in a way that plain yacc can understand. If you want to use a Bison extension, you can write empty rules as %empty, which distinguishes them from accidentally missing rules.
Bison’s --graph visualization doesn’t render empty rules properly. Use the -v option and examine the textual .output file to see the rule.
Adding multiple empty rules can be common source of reduce/reduce conflicts. I ran into this with early experiments in parsing CSV, and the Bison manual section 5.6 provides a great example.

Now that we’ve seen the structure of the grammar, let’s fill in the skeleton to process the CSV content. From now on, examples in this article will use my libderp library for basic data structures like maps and vectors.

/* csv.l */

%{
#define _XOPEN_SOURCE 600
#include <stdlib.h>
#include <string.h>

/* the union in csv.tab.h requires the vector type, and
   plain yacc doesn't have "%code requires" to provide
   the include like Bison, so we include derp/vector.h */
#include <derp/vector.h>
#include "csv.tab.h"
%}

%%

\"([^"]|\"\")*\" {
	/* yyleng is precomputed strlen(yytext) */
    size_t i, n = yyleng;
    char *s;

    s = yylval.str = calloc(n, 1);
    if (!s)
        return FIELD;

	/* copy yytext, changing "" to " */
    for (i = 1 /*skip 0="*/; i < n-1; i++)
    {
        *s++ = yytext[i];
        if (yytext[i] == '"')
            i++; /* skip second one */
    }
    return FIELD;
}

[^",\r\n]+ { yylval.str = strdup(yytext); return FIELD; }
\n|\r\n    { return CRLF; }
.          { return *yytext; }

The complete parser below combines values from the lexer into full records, using the vector type. It then prints each record and frees it.

/* csv.y  (plain yacc) */

%{
	#include <stdbool.h>
	#include <stdio.h>
	#include <stdlib.h>

	/* for the vector datatype and v_ functions */
	#include <derp/vector.h>
	/* for helper function derp_free */
	#include <derp/common.h>

	int yylex(void);
	int yyerror(const char *s);
	bool one_empty_field(vector *);
%}

%union
{
	char *str;
	vector *record;
}

%token CRLF
%token <str> FIELD
%type <str> field.opt
%type <record> record

/* in bison, add this:

%destructor { free($$); } <str>
%destructor { v_free($$); } <record>
*/

%%

file :
  consumed_record
| file CRLF consumed_record
;

/* A record can be constructed in two ways, but we want to
   run the same side effect for either case. We add an
   intermediate non-terminal symbol "consumed_record" just
   to perform the action. In library code, this would be a
   good place to send the the record to a callback function. */

consumed_record :
  record {
	/* a record comprised of exactly one blank field is a
	   blank record, which we can skip */
	if (!one_empty_field($1))
	{
		size_t n = v_length($1);
		printf("#fields = %zu\n", n);
		for (size_t i = 0; i < n; i++)
			printf("\t%s\n", (char*)v_at($1, i));
	}
	v_free($1);
  }
;

record :
  field.opt {
	/* In our earlier example, lisp.y, we showed how to check
	   for memory allocation failure. We skip that here for
	   brevity. */

	vector *r = v_new();
	v_dtor(r, derp_free, NULL);
	v_append(r, $1);
	$$ = r;
  }
| record ',' field.opt {
	v_append($1, $3);
	$$ = $1;
  }
;

field.opt :
  /* empty */ { $$ = calloc(1,1); }
| FIELD
;

%%

bool one_empty_field(vector *r)
{
	return v_length(r) == 1 && *((char*)v_first(r)) == '\0';
}

int yyerror(const char *s)
{
	return fprintf(stderr, "%s\n", s);
}

Build it (using the steps shown for earlier examples). You’ll also need to link with libderp version 0.1.0, which you can see how to do in the project readme.

Next, verify with test cases:

# https://en.wikipedia.org/wiki/Comma-separated_values#Example

$ ./csv << EOF
Year,Make,Model,Description,Price
1997,Ford,E350,"ac, abs, moon",3000.00
1999,Chevy,"Venture ""Extended Edition""","",4900.00
1999,Chevy,"Venture ""Extended Edition, Very Large""",,5000.00
1996,Jeep,Grand Cherokee,"MUST SELL!
air, moon roof, loaded",4799.00
EOF

#fields = 5
        Year
        Make
        Model
        Description
        Price
#fields = 5
        1997
        Ford
        E350
        ac, abs, moon
        3000.00
#fields = 5
        1999
        Chevy
        Venture "Extended Edition"
        
        4900.00
#fields = 5
        1999
        Chevy
        Venture "Extended Edition, Very Large"
        
        5000.00
#fields = 5
        1996
        Jeep
        Grand Cherokee
        MUST SELL!
air, moon roof, loaded
        4799.00

# extra testing for empty fields before crlf and eof

$ printf ",\n," | ./csv

#fields = 2
        
        
#fields = 2

Parsing a more complicated RFC

IRCv3 extends the Internet Relay Chat (IRC) protocol with useful features. Its core syntactical change to support new features is message tagging. We’ll write a parser to extract information from RFC 1459 messages, including IRCv3 tags.

The BNF from this standard is written in a slightly different dialect than that of the CSV RFC.

<message>       ::= ['@' <tags> <SPACE>] [':' <prefix> <SPACE> ] <command> [params] <crlf>

<tags>          ::= <tag> [';' <tag>]*
<tag>           ::= <key> ['=' <escaped_value>]
<key>           ::= [ <client_prefix> ] [ <vendor> '/' ] <key_name>
<client_prefix> ::= '+'
<key_name>      ::= <non-empty sequence of ascii letters, digits, hyphens ('-')>
<escaped_value> ::= <sequence of zero or more utf8 characters except NUL, CR, LF, semicolon (`;`) and SPACE>
<vendor>        ::= <host>
<host>          ::= see RFC 952 [DNS:4] for details on allowed hostnames

<prefix>        ::= <servername> | <nick> [ '!' <user> ] [ '@' <host> ]
<nick>          ::= <letter> { <letter> | <number> | <special> }
<command>       ::= <letter> { <letter> } | <number> <number> <number>
<SPACE>         ::= ' ' { ' ' }
<params>        ::= <SPACE> [ ':' <trailing> | <middle> <params> ]
<middle>        ::= <Any *non-empty* sequence of octets not including SPACE
                    or NUL or CR or LF, the first of which may not be ':'>
<trailing>      ::= <Any, possibly *empty*, sequence of octets not including
                      NUL or CR or LF>

<user>          ::= <nonwhite> { <nonwhite> }
<letter>        ::= 'a' ... 'z' | 'A' ... 'Z'
<number>        ::= '0' ... '9'
<crlf>          ::= CR LF

As before, it’s helpful to start from the bottom up, applying the power of lex regexes. However, we run into the problem that most of the tokens match almost anything. The same string could conceivably be a host, nick, user, key_name, and command all at once. Lex would match the string with whichever rule comes first in the grammar.

Yacc can’t easily pass lex any clues about what tokens it expects, given what tokens have come before. Lex is on its own. For this reason, the designers of lex gave it a way to keep a memory. Rules can be tagged with a start condition, saying they are eligible only in certain states. Rule actions can then enter new states prior to returning.

/* Incomplete irc.l, showing start conditions and patterns.

   This lexer produces the following tokens:
   SPACE COMMAND MIDDLE TRAILING TAG PREFIX ':' '@'
*/

/* It's nice to prefix the regex names with "re_"
   to see them better in the rules */

re_space    [ ]+
re_host     [[:alnum:]][[:alnum:]\.\-]*
re_nick     [[:alpha:]][[:alnum:]\-\[\]\\`^{}_]*
re_user     [~[:alpha:]][[:alnum:]]*
re_keyname  [[:alnum:]\-]+
re_keyval   [^ ;\r\n]*
re_command  [[:alpha:]]+|[[:digit:]]{3}
re_middle   [^: \r\n][^ \r\n]*
re_trailing [^\r\n]*

/* Declare start conditions. The "%x" means
   they are exclusive, vs "%s" for inclusive. */

%x IN_TAGS IN_PREFIX IN_PARAMS

%%

 /* these patterns are not tagged with a start
    condition, and are active in the default state
    of INITIAL. They will match only when none of
    the exclusive conditions are active. They
    *would* match on inclusive states (but we have
    none).

    The BEGIN command changes state. */

@ { BEGIN IN_TAGS; return *yytext; }
: { BEGIN IN_PREFIX; return *yytext; }

{re_space} { return SPACE; }
{re_command} {
	/* TODO: construct yylval */
	BEGIN IN_PARAMS;
	return COMMAND;
}


 /* these patterns will only match IN_TAGS, which
    as we saw earlier, gets activated from the
    INITIAL state when "@" is encountered */

<IN_TAGS>\+?({re_host}\/)?{re_keyname}(={re_keyval})?  {
	/* TODO: construct yylval */
	return TAG;
}
<IN_TAGS>{re_space} {
	BEGIN INITIAL;
	return SPACE;
}
<IN_TAGS>; { return ';'; }


<IN_PREFIX>({re_host})|({re_nick})(!{re_user})?(@{re_host})? {
	/* TODO: construct yylval */
	BEGIN INITIAL;
	return PREFIX;
}


<IN_PARAMS>{re_space} { return SPACE; }
<IN_PARAMS>{re_middle} {
	/* TODO: construct yylval */
	return MIDDLE;
}
<IN_PARAMS>:{re_trailing} {
	/* TODO: construct yylval */
	BEGIN INITIAL;
	return TRAILING;
}


 /* the "*" state applies to all states,
    including INITIAL and the exclusive ones */

<*>\n|\r\n  ; /* ignore */

We’ll revisit the lexer to fill in details for assigning yylval. First, let’s see the parser and its data types.

/* irc.y  (Bison only)

   Using Bison mostly for the %code positions, making 
   it easier to use libderp between flex and bison.

   - WARNING -
   There is absolutely no memory hygiene in this example.
   We don't check for allocation failure, and we don't free
   things when done. See the earlier lisp.y/.l examples
   for guidance about that.
*/

/* output more descriptive messages than "syntax error" */
%define parse.error verbose

%code top {
	#define _XOPEN_SOURCE 600
	#include <stdio.h>
	#include <stdlib.h>
}

%code requires {
	#include <derp/list.h>
	#include <derp/treemap.h>

	struct prefix
	{
		char *host;
		char *nick;
		char *user;
	};

	/* building an irc_message is the overall
	   goal for this parser */
	struct irc_message
	{
		treemap *tags;
		struct prefix *prefix;
		char *command;
		list *params;
	};
}

%code provides {
	int yyerror(char const *msg);
	int yylex(void);
	void message_print(struct irc_message *m);
}

%union
{
	char *str;
	struct prefix *prefix;
	treemap *map;
	struct map_pair *pair;
	list *list;
	struct irc_message *msg;
}

%token          SPACE
%token <str>    COMMAND MIDDLE TRAILING
%token <pair>   TAG
%token <prefix> PREFIX

%type <msg> message tagged_message prefixed_message
%type <map> tags
%type <list> params

%%

 /* Like in the CSV example, we start with a dummy
    rule just to add side-effects */

final :
  tagged_message { message_print($1); }
;

 /* Messages begin with two optional components,
    a set of tags and a prefix.
 
    <message> ::= ['@' <tags> <SPACE>] [':' <prefix> <SPACE> ] <command> [params]
 
    Rather than making a single message rule with
    tons of variations (and duplicated code), I chose
    to build the message in stages.
 
    tagged_message <- prefixed_message <- message
 
    A prefixed_message adds prefix information, or
    passes the message along verbatim if there is none.
    Similarly for tagged_message. */

tagged_message :

  /* When there are more than one matched token,
     it's helpful to add Bison "named references"
     in brackets. Thus, below, the rule can refer to
     $ts rather than $2, or $msg rather than $4.
     Makes it way easier to rearrange tokens while
     you're experimenting. */

  '@' tags[ts] SPACE prefixed_message[msg] {
	$msg->tags = $ts;
	$$ = $msg;
  }

  /* here's the pass-through case when there are
     no tags on the message */

| prefixed_message
;

prefixed_message :
  ':' PREFIX[pfx] SPACE message[msg] {
	$msg->prefix = $pfx;
	$$ = $msg;
  }
| message
;

message :
  COMMAND[cmd] params[ps] {
	struct irc_message *m = malloc(sizeof *m);
	*m = (struct irc_message) {
		.command=$cmd, .params=$ps
	};
	$$ = m;
  }
;

tags :
  TAG {
	treemap *t = tm_new(derp_strcmp, NULL);
	tm_insert(t, $1->k, $1->v);
	$$ = t;
  }
| tags[ts] ';' TAG[t] {
	tm_insert($ts, $t->k, $t->v);
	$$ = $ts;
  }
;

params :
  SPACE TRAILING {
	$$ = l_new();
	l_prepend($$, $2);
  }
| SPACE MIDDLE[mid] params[ps] {
	l_prepend($ps, $mid);
	$$ = $ps;
  }
| %empty {
	$$ = l_new();
  }
;

%%

int yyerror(char const *msg)
{
	return fprintf(stderr, "%s\n", msg);
}

void message_print(struct irc_message *m)
{
	if (m->tags)
	{
		struct tm_iter  *it = tm_iter_begin(m->tags);
		struct map_pair *p;

		puts("Tags:");
		while ((p = tm_iter_next(it)) != NULL)
			printf("\t'%s'='%s'\n", (char*)p->k, (char*)p->v);
		tm_iter_free(it);
	}
	if (m->prefix)
		printf("Prefix: Nick %s, User %s, Host %s\n",
		       m->prefix->nick, m->prefix->user,
			   m->prefix->host);
	if (m->command)
		printf("Command: %s\n", m->command);
	if (!l_is_empty(m->params))
	{
		puts("Params:");
		for (list_item *li = l_first(m->params); li; li = li->next)
			printf("\t%s\n", (char*)li->data);
	}
}

Returning to the lexer, here is the code with all the details filled in to construct yylval for the tokens.

/* irc.l  - complete file */

%option noyywrap nounput noinput

%{
#include "irc.tab.h"

#define _XOPEN_SOURCE 600

#include <limits.h>
#include <stdlib.h>
#include <string.h>
%}

re_space    [ ]+
re_host     [[:alnum:]][[:alnum:]\.\-]*
re_nick     [[:alpha:]][[:alnum:]\-\[\]\\`^{}_]*
re_user     [~[:alpha:]][[:alnum:]]*
re_keyname  [[:alnum:]\-]+
re_keyval   [^ ;\r\n]*
re_command  [[:alpha:]]+|[[:digit:]]{3}
re_middle   [^: \r\n][^ \r\n]*
re_trailing [^\r\n]*

%x IN_TAGS IN_PREFIX IN_PARAMS

%%

@ { BEGIN IN_TAGS; return *yytext; }
: { BEGIN IN_PREFIX; return *yytext; }

{re_space} { return SPACE; }
{re_command} {
	yylval.str = strdup(yytext);
	BEGIN IN_PARAMS;
	return COMMAND;
}


<IN_TAGS>\+?({re_host}\/)?{re_keyname}(={re_keyval})?  {
	struct map_pair *p = malloc(sizeof *p);
	char *split = strchr(yytext, '=');
	if (split)
		*split = '\0';
	*p = (struct map_pair){
		.k = strdup(yytext),
		.v = split ? strdup(split+1) : calloc(1,1)
	};
	yylval.pair = p;
	return TAG;
}
<IN_TAGS>{re_space} {
	BEGIN INITIAL;
	return SPACE;
}
<IN_TAGS>; { return ';'; }


<IN_PREFIX>({re_host})|({re_nick})(!{re_user})?(@{re_host})? {
	struct prefix *p = malloc(sizeof *p);
	if (!p)
		goto done;
	*p = (struct prefix){0};
	char *bang = strchr(yytext, '!'),
	     *at   = strchr(yytext, '@');
	if (!bang && !at)
	{
		p->host = strdup(yytext);
		goto done;
	}
	if (bang) *bang = '\0';
	if (at) *at = '\0';
	p->nick = strdup(yytext);
	if (bang)
		p->user = strdup(bang+1);
	if (at)
		p->host = strdup(at+1);
done:
	yylval.prefix = p;
	BEGIN INITIAL;
	return PREFIX;
}


<IN_PARAMS>{re_space} { return SPACE; }
<IN_PARAMS>{re_middle} {
	yylval.str = strdup(yytext);
	return MIDDLE;
}
<IN_PARAMS>:{re_trailing} {
	yylval.str = strdup(yytext+1); /* trim : */
	BEGIN INITIAL;
	return TRAILING;
}

<*>\n|\r\n  ; /* ignore */

Build irc.y and irc.l according to our typical pattern (and link with libderp). Here’s an example of the IRCv3 parser in action:

# Try an example from
# https://ircv3.net/specs/extensions/message-tags#examples

$ ./irc <<EOF
@aaa=bbb;ccc;example.com/ddd=eee :[email protected] PRIVMSG me :Hello
EOF

Tags:
        'aaa'='bbb'
        'ccc'=''
        'example.com/ddd'='eee'
Prefix: Nick nick, User ident, Host host.com
Command: PRIVMSG
Params:
        me
        Hello

Further resources

POSIX (issue 7) specifications for Lex and Yacc. (To view POSIX docs locally, try begriffs/posix-man.)
Lex & Yacc, 2nd ed by John R. Levine, Tony Mason, Doug Brown. Levine subsequently wrote an updated book called flex & bison: Text Processing Tools. However I got the older version to get a better feel for history and portability.
To bridge the gap between core knowledge and the latest features, consult the GNU Bison manual and the Flex manual. (You can build the Flex manual from source, or download version 2.6.4 that I’ve pre-built for you as PDF.)
Effective Flex & Bison by Chris L. verBurg is a collection of tips for “correctness, efficiency, robustness, complexity, maintainability and usability.” It’s clear Chris has plenty of experience writing real-world parsers.
Vim has classic yacc highlighting built in, but you can add support for Bison extensions with justinmk/vim-syntax-extra.

Written by Joe "begriffs" Nelson. [email protected] ^🔑