Although parsing is often described from the perspective of writing a compiler, there are many common smaller tasks where it’s useful. Reading file formats, talking over the network, creating shells, and analyzing source code are all easier using a robust parser.
By taking time to learn general-purpose parsing tools, you can go beyond fragile homemade solutions, and inflexible third-party libraries. We’ll cover Lex and Yacc in this guide because they are mature and portable. We’ll also cover their later incarnations as Flex and Bison.
Above all, this guide is practical. We’ll see how to properly integrate parser
generators into your build system, how to create thread-safe parsing modules,
and how to parse real data formats. I’ll motivate each feature of the parser
generator with a concrete problem it can solve. And, I promise, none of the
typical calculator examples.
Table of contents
Lexical scanning
People usually use two stages to process structured text. The first stage, lexing (aka scanning), breaks the input into meaningful chunks of characters. The second, parsing, groups the scanned chunks following potentially recursive rules. However, a nice lexing tool like Lex can be useful on its own, even when not paired with a parser.
The simplest way to describe Lex is that it runs user-supplied C code blocks for regular expression matches. It reads a list of regexes and constructs a giant state machine which attempts to match them all “simultaneously.”
A lex input file is composed of three possible sections: definitions, rules,
and helper functions. The sections are delimited by %%. Lex transforms its
input file into a plain C file that can be built using an ordinary C compiler.
Here’s an example. We’ll match the strings cot, cat, and cats. Our
actions will print a replacement for each.
/* catcot.l */
%{
#include <stdio.h>
%}
%%
cot { printf("portable bed"); }
cat { printf("thankless pet"); }
cats { printf("anti-herd"); }To build it:
# turn the input into an intermediate C file
lex -t catcot.l > catcot.c
# compile it
cc -o catcot catcot.c -ll(Alternately, build it in one step with make catcot. Even in the absence of a
Makefile, POSIX make has suffix
rules
that handle .l files.)
The program outputs simple substitutions:
echo "the cat on the cot joined the cats" | ./catcot
the thankless pet on the portable bed joined the anti-herdThe reason it prints non-matching words (such as “the”) is that there’s an
implicit rule matching any character (.) and echoing it. In most real parsers
we’ll want to override that.
Here’s what’s happening inside the scanner. Lex reads the regexes and generates a state machine to consume input. Below is a visualization of the states, with transitions labeled by input character. The circles with a double outline indicate states that trigger actions.
Note there’s no notion of word boundaries in our lexer, it’s operating on characters alone. For instance:
echo "catch!" | ./catcot
thankless petch!That sounds rather like an insult.
An important subtlety is how Lex handles multiple eligible matches. It picks the longest possible match available, and in the case of a tie, picks the matching pattern defined earliest.
To illustrate, suppose we add a looser regex, c.t, first.
%%
c.t { printf("mumble mumble"); }
cot { printf("portable bed"); }
cat { printf("thankless pet"); }
cats { printf("anti-herd"); }Lex detects that the rule masks cat and cot, and outputs a warning:
catcot.l:10: warning, rule cannot be matched
catcot.l:11: warning, rule cannot be matched
It still compiles though, and behaves like this:
echo "the cat on the cot joined the cats" | ./catcot
the mumble mumble on the mumble mumble joined the anti-herd
Notice that it still matched cats, because cats is longer than c.t.
Compare what happens if we move the loose regex to the end of our rules. It can then pick up whatever strings get past the others.
%%
cot { printf("portable bed"); }
cat { printf("thankless pet"); }
cats { printf("anti-herd"); }
c.t { printf("mumble mumble"); } It acts like this:
echo "cut the cot" | ./catcot
mumble mumble the portable bed
Now’s a good time to take a detour and observe how our user-defined code acts
in the generated C file. Lex creates a function called yylex(), and inserts
the code blocks verbatim into a switch statement. When using lex with a parser,
the parser will call yylex() to retrieve tokens, named by integers. For now,
our user-defined code isn’t returning tokens to a parser, but doing simple
print statements.
/* catcot.c (generated by lex) */
int yylex (void)
{
/* ... */
switch ( yy_act )
{
/* ... */
case 1:
YY_RULE_SETUP
#line 9 "catcot.l"
{ printf("portable bed"); }
YY_BREAK
case 2:
YY_RULE_SETUP
#line 10 "catcot.l"
{ printf("thankless pet"); }
YY_BREAK
case 3:
YY_RULE_SETUP
#line 11 "catcot.l"
{ printf("anti-herd"); }
YY_BREAK
/* ... */
}
/* ... */
}As mentioned, a lex file is comprised of three sections:
DEFINITIONS
%%
RULES
%%
HELPER FUNCTIONS
The definitions section is where you can embed C code to include headers and declare functions used in rules. The definitions section can also define friendly names for regexes that can be reused in the rules.
The rules section, as we saw, contains a list of regexes and associated user code.
The final section is where to put the full definitions of helper functions.
This is also where you’d put the main() function. If you omit main(), the
Lex library provides one that simply calls yylex(). This default main()
implementation (and implementations for a few other functions) is available by
linking your lex-generated C code with -ll compiler flag.
Let’s see a short, fun example: converting Roman numerals to decimal. Thanks to lex’s behavior of matching longer strings first, it can read the single-letter numerals, but look ahead for longer subtractive forms like “IV” or “XC.”
/* roman-lex.l */
/* the %{ ... %} enclose C blocks that are copied
into the generated code */
%{
#include <stdio.h>
#include <stdlib.h>
/* globals are visible to user actions amd main() */
int total;
%}
%%
/*<- notice the whitespace before this comment, which
is necessary for comments in the rules section */
/* the basics */
I { total += 1; }
V { total += 5; }
X { total += 10; }
L { total += 50; }
C { total += 100; }
D { total += 500; }
M { total += 1000; }
/* special cases match with preference
because they are longer strings */
IV { total += 4; }
IX { total += 9; }
XL { total += 40; }
XC { total += 90; }
CD { total += 400; }
CM { total += 900; }
/* ignore final newline */
\n ;
/* but die on anything else */
. {
fprintf(stderr, "unexpected: %s\n", yytext);
exit(EXIT_FAILURE);
}
%%
/* provide our own main() rather than the implementation
from lex's library linked with -ll */
int main(void)
{
/* only have to call yylex() once, since our
actions don't return */
yylex();
fprintf(yyout, "%d\n", total);
return EXIT_SUCCESS;
}More realistic scanner
Now that we’ve seen Lex’s basic operation in the previous section, let’s consider a useful example: syntax highlighting. Detecting keywords in syntax is a problem that lex can handle by itself, without help from yacc.
Because lex and yacc are so old (predating C), and used in so many projects, you can find grammars already written for most languages. For instance, we’ll take quut’s C specification for lex, and modify it to do syntax highlighting.
This relatively short program accurately handles the full complexity of the language. It’s easiest to understand by reading in full. See the inline comments for new and subtle details.
/* c.l syntax highlighter */
%{
/* POSIX for isatty, fileno */
#define _POSIX_C_SOURCE 200112L
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
/* declarations are visible to user actions */
enum FG
{
fgRED = 31, fgGREEN = 32,
fgORANGE = 33, fgCYAN = 36,
fgDARKGREY = 90, fgYELLOW = 93
};
void set_color(enum FG);
void reset_color(void);
void color_print(enum FG, const char *);
void consume_comment(void);
%}
/* named regexes we can use in rules */
O [0-7]
D [0-9]
NZ [1-9]
L [a-zA-Z_]
A [a-zA-Z_0-9]
H [a-fA-F0-9]
HP (0[xX])
E ([Ee][+-]?{D}+)
P ([Pp][+-]?{D}+)
FS (f|F|l|L)
IS (((u|U)(l|L|ll|LL)?)|((l|L|ll|LL)(u|U)?))
CP (u|U|L)
SP (u8|u|U|L)
ES (\\(['"\?\\abfnrtv]|[0-7]{1,3}|x[a-fA-F0-9]+))
WS [ \t\v\n\f]
%%
/* attempting to match and capture an entire multi-line
comment could strain lex's buffers, so we match the
beginning, and call consume_comment() to deal with
the ensuing characters, in our own less resource-
intensive way */
"/*" {
set_color(fgDARKGREY);
/* For greater flexibility, we'll output to lex's stream, yyout.
It defaults to stdout. */
fputs(yytext, yyout);
consume_comment();
reset_color();
}
/* single-line comments can be handled the default way.
The yytext variable is provided by lex and points
to the characters that match the regex */
"//".* {
color_print(fgDARKGREY, yytext);
}
^[ \t]*#.* {
color_print(fgRED, yytext);
}
/* you can use the same code block for multiple regexes */
auto |
bool |
char |
const |
double |
enum |
extern |
float |
inline |
int |
long |
register |
restrict |
short |
size_t |
signed |
static |
struct |
typedef |
union |
unsigned |
void |
volatile |
_Bool |
_Complex {
color_print(fgGREEN, yytext);
}
break |
case |
continue |
default |
do |
else |
for |
goto |
if |
return |
sizeof |
switch |
while {
color_print(fgYELLOW, yytext);
}
/* we use the named regexes heavily below; putting
them in curly brackets expands them */
{L}{A}* {
/* without this rule, keywords within larger words
would be highlighted, like the "if" in "life" --
this rule prevents that because it's a longer match */
fputs(yytext, yyout);
}
{HP}{H}+{IS}? |
{NZ}{D}*{IS}? |
"0"{O}*{IS}? |
{CP}?"'"([^'\\\n]|{ES})+"'" |
{D}+{E}{FS}? |
{D}*"."{D}+{E}?{FS}? |
{D}+"."{E}?{FS}? |
{HP}{H}+{P}{FS}? |
{HP}{H}*"."{H}+{P}{FS}? |
{HP}{H}+"."{P}{FS}? {
color_print(fgCYAN, yytext);
}
({SP}?\"([^"\\\n]|{ES})*\"{WS}*)+ {
color_print(fgORANGE, yytext);
}
/* explicitly mention the default rule */
. ECHO;
%%
/* definitions of the functions we declared earlier */
/* the color functions use ANSI escape codes, and may
not be portable across all terminal emulators. */
void set_color(enum FG c)
{
fprintf(yyout, "\033[%d;1m", c);
}
void reset_color(void)
{
fputs("\033[0m", yyout);
}
void color_print(enum FG c, const char *s)
{
set_color(c);
fputs(s, yyout);
reset_color();
}
/* this function directly consumes characters in lex
using the input() function. It pulls characters
from the same stream that the regex state machine
reads. */
void consume_comment(void)
{
int c;
/* EOF in lex is 0, which is different from
the EOF macro in the C standard library */
while ((c = input()) != 0)
{
putchar(c);
if (c == '*')
{
while ((c = input()) == '*')
putchar(c);
if (c == 0) break;
putchar(c);
if (c == '/') return;
}
}
}
int main(void)
{
if (!isatty(fileno(stdout)))
{
/* a more flexible option would be to make the
color changing functions do nothing, but that's
too much fuss for an example program */
fputs("Stdout is not a terminal\n", stderr);
return EXIT_FAILURE;
}
/* since we'll be changing terminal color, be sure to
reset it for any program termination event */
atexit(reset_color);
/* let our lex rules do the rest */
yylex();
return EXIT_SUCCESS;
}Using a scanner as a library
One of the biggest areas of improvement between classic lex/yacc and flex/bison is the ability of the latter to generate code that’s easier to embed into a larger application. Lex and yacc are designed to create standalone programs, with user-defined code blocks stuck inside. When classic lex and yacc work together, they use a bunch of global variables.
Flex and Bison, on the other hand, can generate thread-safe functions with uniquely prefixed names that can be safely linked into larger programs. To demonstrate, we’ll do another scanner (with Flex this time).
The following Rube Goldberg contraption uses Flex to split words on whitespace and call a user-supplied callback for each word. There’s certainly an easier non-Flex way to do this task, but this example illustrates how to encapsulate Flex code into a reusable library.
/* words.l */
/* don't generate functions we don't need */
%option nounput noinput noyywrap
/* generate a scanner that's thread safe */
%option reentrant
/* Generate "words" rather than "yy" as a prefix, e.g.
wordslex() rather than yylex(). This allows multiple
Flex scanners to be linked with the same application */
%option prefix="words"
%%
[^ \t\n]+ {
/* the return statement causes yylex to stop and return */
return 1; /* our code for a word token */
}
/* do nothing for any other characters, don't
output them as would be the default behavior */
.|\n ;
%%
/* Callers interact with this function, which neatly hides
the Flex inside.
Also, we'll call "yy" functions like "yylex()" inside,
and Flex will rename them in the resulting C file to
calls with the "words" prefix, like "wordslex()"
Zero return means success, nonzero is a Flex error
code. */
int words_callback(char *s, void (*f)(const char *))
{
/* in the reentrant mode, we maintain our
own scanner and its associated state */
int i;
yyscan_t scanner;
YY_BUFFER_STATE buf;
if ((i = yylex_init(&scanner)) != 0)
return i;
/* read from a string rather than a stream */
buf = yy_scan_string(s, scanner);
/* Each time yylex finds a word, it returns nonzero.
It resumes where it left off when we call it again */
while ((i = yylex(scanner)) > 0)
{
/* call the user supplied function f with
yytext of the match */
f(yyget_text(scanner));
}
/* clean up */
yy_delete_buffer(buf, scanner);
yylex_destroy(scanner);
return 0;
}Build it like this:
# generate scanner, build object file
flex -t words.l > words.c
cc -c words.c
# verify that all public text symbols are prefixed by "words"
nm -g words.o | grep " T "A calling program can use our library without seeing any Flex internals.
/* test_words.c */
#include <stdio.h>
/* words_callback defined in the object file -- you could put
this declaration in a header file words.h */
int words_callback(char *, void (*)(const char *));
void print_word(const char *w)
{
puts(w);
/* if you want to use the parameter w in the future, you
need to duplicate it in memory whose lifetime you control */
}
int main(void)
{
words_callback(
"The quick brown fox\n"
"jumped over the lazy dog\n",
&print_word
);
return 0;
}To build the program, just link it with words.o.
cc -o test_words test_words.c words.oParsing
Now that we’ve seen how to identify tokens with a scanner, let’s learn how a parser can act on the tokens using recursive rules. Yacc/byacc/bison are LALR (look-ahead left recursive) parsers, and Bison supports more powerful modes if desired.
Mental model of LR parsing
LR parsers build bottom-up toward a goal, shifting tokens onto a stack and combining (“reducing”) them according to rules. It’s helpful to get a mental model for this process, so let’s jump into a simple example and simulate what yacc does.
Here’s a yacc grammar with a single rule to build a result called foo. We specify that foo is comprised of lex tokens A, B, and C.
%token A B C
%%
foo: A B CYacc transforms the grammar into a state machine which looks like this:
The first rule in the file (and the only rule in our case) becomes yacc’s
goal. Yacc begins in state 0, with the implicit rule 0: $accept: • foo $end.
The parse will be accepted if we can produce a foo followed immediately by
the end of input. The bullet point indicates our progress reading the input. In
state 0 it’s at the beginning, meaning we haven’t read anything yet.
Initially there’s no lookahead token, so yacc calls yylex() to get one. If
lex produces an A, we follow the state transition to state 1. Because the arrow
is a solid line, not dashed, yacc “shifts” the token to its token stack. It also
pushes state 1 onto a state stack, which now holds states 0 and 1.
State 1 is trying to satisfy the rule which it calls rule 1, namely 1 foo: A • B C. The bullet point after the A indicates we’ve seen the A already. Don’t
confuse the state numbers and rule numbers – yacc numbers them independently.
Yacc continues processing input, shifting tokens and moving to states 3 and 5 if lex produces the expected tokens. If, at any point, lex produces a token not matching any transitions in the current state, then yacc reports a syntax error and terminates. (There’s a way to do error recovery, but that’s another topic.)
State 5 has seen all necessary tokens for rule 1: 1 foo: A B C •. Yacc
continues to the diamond marked “R1,” which is a reduction action. Yacc
“reduces” rule 1, popping the A, B, C terminal tokens off the stack and pushing
a single non-terminal foo token. When it pops the three tokens, it pops the
same number of states (states 5, 3, and 1). Popping three states lands us back
in state 0.
State 0 has a dashed line going to state 2 that matches the foo token that was just reduced. The dashed line means “goto” rather than “shift,” because rule 0 doesn’t have to shift anything onto the token stack. The previous reduction already took care of that.
Finally, state 2 asks lex for another token, and if lex reports EOF, that
matches $end and sends us to state 4, which ties a ribbon on it with the
Acc(ept) action.
From what we’ve seen so far, each state may seem to be merely tracking progress through a single rule. However, states actually track all legal ways forward from tokens previously consumed. A single state can track multiple candidate rules. For instance:
%token A B C
%%
/* foo is either x or y */
foo: x | y;
/* x and y both start with an A */
x: A B;
y: A C;For this grammar, yacc produces the following state machine:
In state 1 we’ve seen token A, and so rules 3 and 4 are both in the running to reduce an x or y. On a B or C token, the possibilities narrow to a single rule (in state 5 or 6).
Also notice that our rule foo : x | y doesn’t occur verbatim in any states.
Yacc separates it into 1 foo: x and 2 foo: y. Thus, the numbered rules
don’t always match the rules in the grammar one-to-one.
Yacc can also use peek ahead by one token to choose which rule to reduce, without shifting the “lookahead” token. In the following grammar, rules x and y match the same tokens. However, the foo rule can say to choose x when followed by a B, or y when followed by a C:
%token A B C
%%
foo : x B | y C;
x : A;
y : A;Note multiple reductions coming out of state 1 in the generated state machine:
The presence of a bracketed token ([C]) exiting state 1 indicates that the
state uses lookahead. If the state sees token C, it reduces rule 4. Otherwise
it reduces rule 3. Lookahead tokens remain to be read when following a
dashed-line (goto) action, such as from state 0 to state 4.
Ambiguous grammars
While yacc is a powerful tool to transform a grammar into a state machine, it may not operate the way you intend on ambiguous grammars. These are grammars with a state that could proceed in more than one way with the same input.
As grammars get complicated, it’s quite possible to create ambiguities. Let’s look at small examples that make it easier to see the mechanics of the conflict. That way, when it happens in a real grammar, we’ll have a better feeling for it.
In the following example, the input A B matches both x and y B. There’s
no reason for yacc to choose one construction over the other when reducing to
foo. So why does this matter, you ask? Don’t we get to foo either way? Yes,
but real parsers will have different user code assigned to run per rule, and it
matters which code block gets executed.
%token A B
%%
foo : x | y B ;
x : A B ;
y : A ;The state machine shows ambiguity at state 1:
At state 1, when the next token is B, the state could shift the token and enter state 5 (attempting to reduce x). It could also reduce y and leave B as lookahead. This is called a shift/reduce conflict. Yacc’s policy in such a conflict is to favor a shift over a reduce.
Alternately, we can construct a grammar with a state that has more than one
eligible reduction for the same input. The purest toy example would be foo : A | A, generating:
In a reduce/reduce conflict, yacc chooses to reduce the conflicting rule presented earlier in the grammar.
Constructing semantic values
While matching tokens, parsers typically build a user-defined value in memory to represent features of the input. Once the parse reaches the goal state and succeeds, then the user code will act on the memory value (or pass it along to a calling program).
Yacc has stores the semantic values from parsed tokens in variables ($1,
$2, …) accessible to code blocks, and it provides a variable ($$) for
assigning the semantic result of the current code block.
Let’s see it in action. We won’t do a hackneyed calculator, but let’s still make a parser that operates on integers. Integer values allow us to avoid thinking about memory management.
We’ll revisit the roman numeral example, and this time let lex match the digits while yacc combines them into a final result. It’s actually more cumbersome than our earlier way, but illustrates how to work with semantic parse values.
There are some comments in the example below about portability between yacc variants. The three most prominent variants, in order of increasing features, are: the POSIX interface matching roughly the AT&T yacc functionally, byacc (Berkeley Yacc), and GNU Bison.
/* roman.y (plain yacc) */
%{
#include <stdio.h>
/* declarations to fix warnings from sloppy
yacc/byacc/bison code generation. For instance,
the code should have a declaration of yylex. */
int yylex(void);
/* The POSIX specification says yyerror should return
int, although bison documentation says the value is
ignored. We match POSIX just in case. */
int yyerror(const char *s);
%}
/* tokens our lexer will produce */
%token NUM
%%
/* The first rule is the final goal. Yacc will work
backward trying to arrive here. This "results" rule
is a stub we use to print the value from "number." */
results :
number { fprintf(yyout, "%d\n", $1); }
;
/* as the lexer produces more NUMs, keep adding them */
number :
/* this is a common pattern for saying number is one or
more NUMs. Notice we specify "number NUM" and not
"NUM number". In yacc recursion, think "right is wrong
and left is right." */
number NUM { $$ = $1 + $2; }
/* base case, using default rule of $$ = $1 */
| NUM
;The corresponding lexer matches individual numerals, and returns them with their semantic values.
/* roman.l */
%{
/* The .tab.h file is generated by yacc, and we'll explain
it later */
#include "roman.tab.h"
/* lex communicates semantic token values to yacc through
a shared global variable */
extern int yylval;
%}
/* when using flex (rather than vanilla lex) fix
unused function warnings by adding:
%option noinput nounput
*/
%%
/* The constant for NUM comes from roman.tab.h,
and was generated because we declared
"%token NUM" in roman.y */
I { yylval = 1; return NUM; }
V { yylval = 5; return NUM; }
X { yylval = 10; return NUM; }
L { yylval = 50; return NUM; }
C { yylval = 100; return NUM; }
D { yylval = 500; return NUM; }
M { yylval = 1000; return NUM; }
IV { yylval = 4; return NUM; }
IX { yylval = 9; return NUM; }
XL { yylval = 40; return NUM; }
XC { yylval = 90; return NUM; }
CD { yylval = 400; return NUM; }
CM { yylval = 900; return NUM; }
/* ignore final newline */
\n ;
/* As a default action, return the ascii value of
the character as if it were a token identifier.
The values from roman.tab.h are offset above 256 to
be above any ascii value, so there's no ambiguity
Our parser won't be expecting these values, so
they will lead to a syntax error */
. { return *yytext; }To review: lex generates a yylex() function, and yacc generates yyparse() that
calls yylex() repeatedly to get new token identifiers. Lex actions copy
semantic values to yylval which Yacc copies into $-variables accessible in
parser rule actions.
Building an executable roman from the input files roman.y and roman.l
requires explanation. With appropriate command line flags, yacc will create the
files roman.tab.c and roman.tab.h from roman.y. Lex will create
roman.lex.c from roman.l, using token identifiers in roman.tab.h.
In short, here are the build dependencies for each file:
And here’s how to express it all in a Makefile.
# put together object files from lexer and parser, and
# link the yacc and lex libraries (in that order, to pick
# main() from yacc's library rather than lex's)
roman : roman.tab.o roman.lex.o
$(CC) -o $@ roman.tab.o roman.lex.o -ly -ll
# tell make which files yacc will generate
#
# an explanation of the arguments:
# -b roman - name the files roman.tab.*
# -d - generate a .tab.h file too
roman.tab.h roman.tab.c : roman.y
$(YACC) -d -b roman $?
# the object file relies on the generated lexer, and
# on the token constants
roman.lex.o : roman.tab.h roman.lex.c
# can't use the default suffix rule because we're
# changing the name of the output to .lex.c
roman.lex.c : roman.l
$(LEX) -t $? > $@And now, the moment of truth:
$ make
$ echo MMMCMXCIX | ./roman
3999Using a parser as a library
In this example we’ll parse LISP S-expressions, limited to string and integer atoms. There’s more going on in this one, such as memory management, different semantic types per token, and packaging the lexer and parser together into a single thread-safe library. This example requires Bison.
/* lisp.y (requires Bison) */
/* a "pure" api means communication variables like yylval
won't be global variables, and yylex is assumed to
have a different signature */
%define api.pure true
/* change prefix of symbols from yy to "lisp" to avoid
clashes with any other parsers we may want to link */
%define api.prefix {lisp}
/* generate much more meaningful errors rather than the
uninformative string "syntax error" */
%define parse.error verbose
/* Bison offers different %code insertion locations in
addition to yacc's %{ %} construct.
The "top" location is good for headers and feature
flags like the _XOPEN_SOURCE we use here */
%code top {
/* XOPEN for strdup */
#define _XOPEN_SOURCE 600
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
/* Bison versions 3.7.5 and above provide the YYNOMEM
macro to allow our actions to signal the unlikely
event that they couldn't allocate memory. Thanks
to the Bison team for adding this feature at my
request. :) YYNOMEM causes yyparse() to return 2.
The following conditional define allows us to use
the functionality in earlier versions too. */
#ifndef YYNOMEM
#define YYNOMEM goto yyexhaustedlab
#endif
}
/* The "requires" code location is designed for defining
data types that we can use as yylval's for tokens. Code
in this section is also added to the .tab.h file for
inclusion by calling code */
%code requires {
enum sexpr_type {
SEXPR_ID, SEXPR_NUM, SEXPR_PAIR, SEXPR_NIL
};
struct sexpr
{
enum sexpr_type type;
union
{
int num;
char *id;
} value;
struct sexpr *left, *right;
};
}
/* These are the semantic types available for tokens,
which we name num, str, and node.
The %union construction is classic yacc as well. It
generates a C union and sets its as the YYSTYPE, which
will be the type of yylval */
%union
{
int num;
char *str;
struct sexpr *node;
}
/* Add another argument in yyparse() so that we
can communicate the parsed result to the caller.
We can't return the result directly, since the
return value is already reserved as an int, with
0=success, 1=error, 2=nomem
NOTE
In our case, the param is a data pointer. However,
if it were a function pointer (such as a callback),
then its type would have to be put behind a typedef,
or else parse-param will mangle the declaration. */
%parse-param {struct sexpr **result}
/* param adds an extra param to yyparse (like parse-param)
but also causes yyparse to send the value to yylex.
In our case the caller will initialize their own scanner
instance and pass it through */
%param {void *scanner}
/* the "provides" location adds the code to our generated
parser, but also to the .tab.h file for use by callers */
%code provides {
void sexpr_free(struct sexpr *s);
}
/* unqualified %code is for internal use, things that
our actions can see. These declarations prevent
warnings. Notice the final param in each that came
from the %param directive above */
%code {
int lisperror(void *foo, char const *msg, const void *s);
int lisplex(void *lval, const void *s);
}
/* Now when we declare tokens, we add their type
in brackets. The type names come from our %union */
%token <str> ID
%token <num> NUM
/* whereas tokens come from the lexer, these
non-terminals are defined in the parser, and
we set their types with %type */
%type <node> start sexpr pair list members atom
/* if there's an error partway through parsing, the
caller wouldn't get a chance to free memory for
the work in progress. Bison will clean up the memory
if we provide destructors, though. */
%destructor { free($$); } <str>
%destructor { sexpr_free($$); } <node>
%%
/* once again we use a dummy non-terminal to perform
a side-effect, in this case setting *result */
start :
sexpr { *result = $$ = $1; return 0; }
;
sexpr :
atom
| list
| pair
;
list :
/* This is a shortcut: we use the ascii value for
parens '('=40, ')'=41 as their token codes.
Thus we don't have to define a bunch of crap
manually like LPAREN, RPAREN */
'(' members ')' { $$ = $2; }
| '('')' {
struct sexpr *nil = malloc(sizeof *nil);
if (!nil) YYNOMEM;
*nil = (struct sexpr){.type = SEXPR_NIL};
$$ = nil;
}
;
members :
sexpr {
struct sexpr *s = malloc(sizeof *s),
*nil = malloc(sizeof *nil);
if (!s || !nil) {
free(s), free(nil);
YYNOMEM;
}
*nil = (struct sexpr){.type = SEXPR_NIL};
/* convention: we assume that a previous parser
value like $1 is non-NULL, else it would have
died already with YYNOMEM. We're responsible
for checking only our own allocations */
*s = (struct sexpr){
.type = SEXPR_PAIR,
.left = $1,
.right = nil
};
$$ = s;
}
| sexpr members {
struct sexpr *s = malloc(sizeof *s);
/* Another important memory convention: we
can't trust that our lexer successfully
allocated its yylvalue, because the signature
of yylex doesn't communicate failure. We
assume NULL in $1 means alloc failure and
we report that. The only other way to signal
from yylex would be to make a fake token to
represent out-of-memory, but that's harder */
if (!s) YYNOMEM;
*s = (struct sexpr){
.type = SEXPR_PAIR,
.left = $1,
.right = $2
};
$$ = s;
}
;
pair :
'(' sexpr '.' sexpr ')' {
struct sexpr *s = malloc(sizeof *s);
if (!s) YYNOMEM;
*s = (struct sexpr){
.type = SEXPR_PAIR,
.left = $2,
.right = $4
};
$$ = s;
}
;
atom :
ID {
if (!$1) YYNOMEM;
struct sexpr *s = malloc(sizeof *s);
if (!s) YYNOMEM;
*s = (struct sexpr){
.type = SEXPR_ID,
.value.id = strdup($1)
};
if (!s->value.id)
{
free(s);
YYNOMEM;
}
$$ = s;
}
| NUM {
struct sexpr *s = malloc(sizeof *s);
if (!s) YYNOMEM;
*s = (struct sexpr){
.type = SEXPR_NUM,
.value.num = $1
};
$$ = s;
}
;
%%
/* notice the extra parameters required
by %param and %parse-param */
int lisperror(void *yylval, char const *msg, const void *s)
{
(void)yylval;
(void)s;
return fprintf(stderr, "%s\n", msg);
}
/* useful internally by us, and externally by callers */
void sexpr_free(struct sexpr *s)
{
if (!s)
return;
if (s->type == SEXPR_ID)
free(s->value.id);
else if (s->type == SEXPR_PAIR)
{
sexpr_free(s->left);
sexpr_free(s->right);
}
free(s);
}The parser does the bulk of the work. We just need to pair it with a scanner that reads atoms and parens.
/* lisp.l */
/* disable unused functions so we don't
get compiler warnings about them */
%option noyywrap nounput noinput
%option noyyalloc noyyrealloc noyyfree
/* change our prefix from yy to lisp */
%option prefix="lisp"
/* use the pure parser calling convention */
%option reentrant bison-bridge
%{
#include "lisp.tab.h"
#define YY_EXIT_FAILURE ((void)yyscanner, EXIT_FAILURE)
/* XOPEN for strdup */
#define _XOPEN_SOURCE 600
#include <limits.h>
#include <stdlib.h>
#include <string.h>
/* seems like a bug that I have to do this, since flex
should know prefix=lisp and match bison's LISPSTYPE */
#define YYSTYPE LISPSTYPE
int lisperror(const char *msg);
%}
%%
[[:alpha:]][[:alnum:]]* {
/* The memory that yytext points to gets overwritten
each time a pattern matches. We need to give the caller
a copy. Also, if strdup fails and returns NULL, it's up
to the caller (the parser) to detect that.
Notice yylval is a pointer to union now. It's passed
as an arg to yylex in pure parsing mode */
yylval->str = strdup(yytext);
return ID;
}
[-+]?[[:digit:]]+ {
long n = strtol(yytext, NULL, 10);
if (n < INT_MIN || n > INT_MAX)
lisperror("Number out of range");
yylval->num = (int)n;
return NUM;
}
[[:space:]] ; /* ignore */
/* this is a handy rule to return the ASCII value
of any other character. Importantly, parens */
. { return *yytext; }Finally, here’s how to call the parser from a regular program.
/* driver_lisp.c */
#include <stdio.h>
#include <stdlib.h>
#define YYSTYPE LISPSTYPE
#include "lisp.tab.h"
#include "lisp.lex.h"
void sexpr_print(struct sexpr* s, unsigned depth)
{
for (unsigned i = 0; i < depth; i++)
printf(" ");
switch (s->type)
{
case SEXPR_ID:
puts(s->value.id);
break;
case SEXPR_NUM:
printf("%d\n", s->value.num);
break;
case SEXPR_PAIR:
puts(".");
sexpr_print(s->left, depth+1);
sexpr_print(s->right, depth+1);
break;
case SEXPR_NIL:
puts("()");
break;
default:
abort();
}
}
int main(void)
{
int i;
struct sexpr *expr;
yyscan_t scanner;
if ((i = lisplex_init(&scanner)) != 0)
exit(i);
int e = lispparse(&expr, scanner);
printf("Code = %d\n", e);
if (e == 0 /* success */)
{
sexpr_print(expr, 0);
sexpr_free(expr);
}
lisplex_destroy(scanner);
return 0;
}To build it, use the Makefile pattern from roman to create analogous
lisp.lex.o and lisp.tab.o. This example requires Flex and Bison, so set
LEX=flex and YACC=bison at the top of the Makefile to override whatever
system defaults are used for these programs. Finally, compile driver_lisp.c
and link with those object files.
Here’s the program in action:
$ echo "(1 () (2 . 3) (4))" | ./driver_lisp
Code = 0
.
1
.
()
.
.
2
3
.
.
4
()
()Designing against an RFC
Internet Request For Comment (RFC) documents describe the syntax of many protocols and data formats. They often include complete Augmented Backus-Naur Form (ABNF) grammars, which we can convert into robust yacc parsers.
Let’s examine RFC4181, which describes the comma-separated value (CSV) format. It’s pretty simple, but has problematic edge cases: commas in quoted values, quoted quotes, raw newlines in quoted values, and blank-as-a-value.
Here’s the full grammar from the RFC. Notice how alternatives are specified
with “/” rather than “|”, and how ABNF has the constructions
*(zero-or-more-things) and [optional-thing]:
file = [header CRLF] record *(CRLF record) [CRLF]
header = name *(COMMA name)
record = field *(COMMA field)
name = field
field = (escaped / non-escaped)
escaped = DQUOTE *(TEXTDATA / COMMA / CR / LF / 2DQUOTE) DQUOTE
non-escaped = *TEXTDATA
COMMA = %x2C
CR = %x0D
DQUOTE = %x22
LF = %x0A
CRLF = CR LF
TEXTDATA = %x20-21 / %x23-2B / %x2D-7E
The grammar makes no distinction between lexing and parsing, although the
uppercase identifiers hint at lexer tokens. While it may be tempting to
translate to yacc top-down, starting at the file level, I’ve found the most
productive way is to start with lexing.
We can combine most of the grammar into two lex rules to match fields:
%%
\"([^"]|\"\")*\" {
/* this is what the ABNF calls "escaped" */
/* TODO: copy un-escaped internals to yylval */
return FIELD;
}
[^",\r\n]+ {
/* This is *almost* what the ABNF calls "un-escaped,"
except it won't match an empty field, like
a,,b
^---- this
Actually, even if we tried matching an empty string,
the comma or crlf would prove a longer match and
trump this one.
*/
/* TODO: capture the value to yylval */
/* no need to bother yacc with two token types, we
call them both FIELD. */
return FIELD;
}
/* handle both UNIX and DOS style, per the spec */
\n|\r\n { return CRLF; }
/* catch the comma, and any other unexpected thing */
. { return *yytext; }With FIELD out of the way, here’s what’s left to translate:
file = [header CRLF] record *(CRLF record) [CRLF]
header = name *(COMMA name)
record = field *(COMMA field)
name = field
Let’s also drop the designation of the first row as the “header.” The application can choose to treat the first ordinary row as a header if desired. This simplifies the grammar to:
file = record *(CRLF record) [CRLF]
record = field *(COMMA field)
At this point it’s easy to convert to yacc.
%token CRLF FIELD
%%
file :
record
| file CRLF record
;
record :
field.opt
| record ',' field.opt
;
/* Here is where we handle the potentially blank
non-escaped FIELD. The ".opt" suffix doesn't mean
anything to yacc, it's just a reminder for us that
this *may* match a FIELD, or nothing at all */
field.opt :
/* empty */
| FIELD
;Matching blank fields is tricky. There are three fields in a,,b, no way
around it. That means we have to identify some value (either a non-terminal
symbol, or a terminal token) out of thin air between characters of input. As
a corollary, given that we have to honor blank fields as existing, we’re forced
to interpret e.g. a 0-byte file as one record with a single blank field.
We handled the situation with an empty yacc rule in field.opt. Empty rules
allow the parser to reduce when it sees unexpected lookahead tokens. Perhaps
it’s also possible to use fancy tricks in the lexer (like trailing context and
start conditions) to also match empty non-escaped fields. However, I think an
empty parser rule is more elegant.
Three notes about empty rules:
- We wrote the empty rule in a way that plain yacc can understand. If you want
to use a Bison extension, you can write empty rules as
%empty, which distinguishes them from accidentally missing rules. - Bison’s
--graphvisualization doesn’t render empty rules properly. Use the-voption and examine the textual.outputfile to see the rule. - Adding multiple empty rules can be common source of reduce/reduce conflicts. I ran into this with early experiments in parsing CSV, and the Bison manual section 5.6 provides a great example.
Now that we’ve seen the structure of the grammar, let’s fill in the skeleton to process the CSV content. From now on, examples in this article will use my libderp library for basic data structures like maps and vectors.
/* csv.l */
%{
#define _XOPEN_SOURCE 600
#include <stdlib.h>
#include <string.h>
/* the union in csv.tab.h requires the vector type, and
plain yacc doesn't have "%code requires" to provide
the include like Bison, so we include derp/vector.h */
#include <derp/vector.h>
#include "csv.tab.h"
%}
%%
\"([^"]|\"\")*\" {
/* yyleng is precomputed strlen(yytext) */
size_t i, n = yyleng;
char *s;
s = yylval.str = calloc(n, 1);
if (!s)
return FIELD;
/* copy yytext, changing "" to " */
for (i = 1 /*skip 0="*/; i < n-1; i++)
{
*s++ = yytext[i];
if (yytext[i] == '"')
i++; /* skip second one */
}
return FIELD;
}
[^",\r\n]+ { yylval.str = strdup(yytext); return FIELD; }
\n|\r\n { return CRLF; }
. { return *yytext; }The complete parser below combines values from the lexer into full records, using the vector type. It then prints each record and frees it.
/* csv.y (plain yacc) */
%{
#include <stdbool.h>
#include <stdio.h>
#include <stdlib.h>
/* for the vector datatype and v_ functions */
#include <derp/vector.h>
/* for helper function derp_free */
#include <derp/common.h>
int yylex(void);
int yyerror(const char *s);
bool one_empty_field(vector *);
%}
%union
{
char *str;
vector *record;
}
%token CRLF
%token <str> FIELD
%type <str> field.opt
%type <record> record
/* in bison, add this:
%destructor { free($$); } <str>
%destructor { v_free($$); } <record>
*/
%%
file :
consumed_record
| file CRLF consumed_record
;
/* A record can be constructed in two ways, but we want to
run the same side effect for either case. We add an
intermediate non-terminal symbol "consumed_record" just
to perform the action. In library code, this would be a
good place to send the the record to a callback function. */
consumed_record :
record {
/* a record comprised of exactly one blank field is a
blank record, which we can skip */
if (!one_empty_field($1))
{
size_t n = v_length($1);
printf("#fields = %zu\n", n);
for (size_t i = 0; i < n; i++)
printf("\t%s\n", (char*)v_at($1, i));
}
v_free($1);
}
;
record :
field.opt {
/* In our earlier example, lisp.y, we showed how to check
for memory allocation failure. We skip that here for
brevity. */
vector *r = v_new();
v_dtor(r, derp_free, NULL);
v_append(r, $1);
$$ = r;
}
| record ',' field.opt {
v_append($1, $3);
$$ = $1;
}
;
field.opt :
/* empty */ { $$ = calloc(1,1); }
| FIELD
;
%%
bool one_empty_field(vector *r)
{
return v_length(r) == 1 && *((char*)v_first(r)) == '\0';
}
int yyerror(const char *s)
{
return fprintf(stderr, "%s\n", s);
}Build it (using the steps shown for earlier examples). You’ll also need to link with libderp version 0.1.0, which you can see how to do in the project readme.
Next, verify with test cases:
# https://en.wikipedia.org/wiki/Comma-separated_values#Example
$ ./csv << EOF
Year,Make,Model,Description,Price
1997,Ford,E350,"ac, abs, moon",3000.00
1999,Chevy,"Venture ""Extended Edition""","",4900.00
1999,Chevy,"Venture ""Extended Edition, Very Large""",,5000.00
1996,Jeep,Grand Cherokee,"MUST SELL!
air, moon roof, loaded",4799.00
EOF#fields = 5
Year
Make
Model
Description
Price
#fields = 5
1997
Ford
E350
ac, abs, moon
3000.00
#fields = 5
1999
Chevy
Venture "Extended Edition"
4900.00
#fields = 5
1999
Chevy
Venture "Extended Edition, Very Large"
5000.00
#fields = 5
1996
Jeep
Grand Cherokee
MUST SELL!
air, moon roof, loaded
4799.00
# extra testing for empty fields before crlf and eof
$ printf ",\n," | ./csv#fields = 2
#fields = 2
Parsing a more complicated RFC
IRCv3 extends the Internet Relay Chat (IRC) protocol with useful features. Its core syntactical change to support new features is message tagging. We’ll write a parser to extract information from RFC 1459 messages, including IRCv3 tags.
The BNF from this standard is written in a slightly different dialect than that of the CSV RFC.
<message> ::= ['@' <tags> <SPACE>] [':' <prefix> <SPACE> ] <command> [params] <crlf>
<tags> ::= <tag> [';' <tag>]*
<tag> ::= <key> ['=' <escaped_value>]
<key> ::= [ <client_prefix> ] [ <vendor> '/' ] <key_name>
<client_prefix> ::= '+'
<key_name> ::= <non-empty sequence of ascii letters, digits, hyphens ('-')>
<escaped_value> ::= <sequence of zero or more utf8 characters except NUL, CR, LF, semicolon (`;`) and SPACE>
<vendor> ::= <host>
<host> ::= see RFC 952 [DNS:4] for details on allowed hostnames
<prefix> ::= <servername> | <nick> [ '!' <user> ] [ '@' <host> ]
<nick> ::= <letter> { <letter> | <number> | <special> }
<command> ::= <letter> { <letter> } | <number> <number> <number>
<SPACE> ::= ' ' { ' ' }
<params> ::= <SPACE> [ ':' <trailing> | <middle> <params> ]
<middle> ::= <Any *non-empty* sequence of octets not including SPACE
or NUL or CR or LF, the first of which may not be ':'>
<trailing> ::= <Any, possibly *empty*, sequence of octets not including
NUL or CR or LF>
<user> ::= <nonwhite> { <nonwhite> }
<letter> ::= 'a' ... 'z' | 'A' ... 'Z'
<number> ::= '0' ... '9'
<crlf> ::= CR LF
As before, it’s helpful to start from the bottom up, applying the power of lex regexes. However, we run into the problem that most of the tokens match almost anything. The same string could conceivably be a host, nick, user, key_name, and command all at once. Lex would match the string with whichever rule comes first in the grammar.
Yacc can’t easily pass lex any clues about what tokens it expects, given what tokens have come before. Lex is on its own. For this reason, the designers of lex gave it a way to keep a memory. Rules can be tagged with a start condition, saying they are eligible only in certain states. Rule actions can then enter new states prior to returning.
/* Incomplete irc.l, showing start conditions and patterns.
This lexer produces the following tokens:
SPACE COMMAND MIDDLE TRAILING TAG PREFIX ':' '@'
*/
/* It's nice to prefix the regex names with "re_"
to see them better in the rules */
re_space [ ]+
re_host [[:alnum:]][[:alnum:]\.\-]*
re_nick [[:alpha:]][[:alnum:]\-\[\]\\`^{}_]*
re_user [~[:alpha:]][[:alnum:]]*
re_keyname [[:alnum:]\-]+
re_keyval [^ ;\r\n]*
re_command [[:alpha:]]+|[[:digit:]]{3}
re_middle [^: \r\n][^ \r\n]*
re_trailing [^\r\n]*
/* Declare start conditions. The "%x" means
they are exclusive, vs "%s" for inclusive. */
%x IN_TAGS IN_PREFIX IN_PARAMS
%%
/* these patterns are not tagged with a start
condition, and are active in the default state
of INITIAL. They will match only when none of
the exclusive conditions are active. They
*would* match on inclusive states (but we have
none).
The BEGIN command changes state. */
@ { BEGIN IN_TAGS; return *yytext; }
: { BEGIN IN_PREFIX; return *yytext; }
{re_space} { return SPACE; }
{re_command} {
/* TODO: construct yylval */
BEGIN IN_PARAMS;
return COMMAND;
}
/* these patterns will only match IN_TAGS, which
as we saw earlier, gets activated from the
INITIAL state when "@" is encountered */
<IN_TAGS>\+?({re_host}\/)?{re_keyname}(={re_keyval})? {
/* TODO: construct yylval */
return TAG;
}
<IN_TAGS>{re_space} {
BEGIN INITIAL;
return SPACE;
}
<IN_TAGS>; { return ';'; }
<IN_PREFIX>({re_host})|({re_nick})(!{re_user})?(@{re_host})? {
/* TODO: construct yylval */
BEGIN INITIAL;
return PREFIX;
}
<IN_PARAMS>{re_space} { return SPACE; }
<IN_PARAMS>{re_middle} {
/* TODO: construct yylval */
return MIDDLE;
}
<IN_PARAMS>:{re_trailing} {
/* TODO: construct yylval */
BEGIN INITIAL;
return TRAILING;
}
/* the "*" state applies to all states,
including INITIAL and the exclusive ones */
<*>\n|\r\n ; /* ignore */We’ll revisit the lexer to fill in details for assigning yylval. First, let’s see the parser and its data types.
/* irc.y (Bison only)
Using Bison mostly for the %code positions, making
it easier to use libderp between flex and bison.
- WARNING -
There is absolutely no memory hygiene in this example.
We don't check for allocation failure, and we don't free
things when done. See the earlier lisp.y/.l examples
for guidance about that.
*/
/* output more descriptive messages than "syntax error" */
%define parse.error verbose
%code top {
#define _XOPEN_SOURCE 600
#include <stdio.h>
#include <stdlib.h>
}
%code requires {
#include <derp/list.h>
#include <derp/treemap.h>
struct prefix
{
char *host;
char *nick;
char *user;
};
/* building an irc_message is the overall
goal for this parser */
struct irc_message
{
treemap *tags;
struct prefix *prefix;
char *command;
list *params;
};
}
%code provides {
int yyerror(char const *msg);
int yylex(void);
void message_print(struct irc_message *m);
}
%union
{
char *str;
struct prefix *prefix;
treemap *map;
struct map_pair *pair;
list *list;
struct irc_message *msg;
}
%token SPACE
%token <str> COMMAND MIDDLE TRAILING
%token <pair> TAG
%token <prefix> PREFIX
%type <msg> message tagged_message prefixed_message
%type <map> tags
%type <list> params
%%
/* Like in the CSV example, we start with a dummy
rule just to add side-effects */
final :
tagged_message { message_print($1); }
;
/* Messages begin with two optional components,
a set of tags and a prefix.
<message> ::= ['@' <tags> <SPACE>] [':' <prefix> <SPACE> ] <command> [params]
Rather than making a single message rule with
tons of variations (and duplicated code), I chose
to build the message in stages.
tagged_message <- prefixed_message <- message
A prefixed_message adds prefix information, or
passes the message along verbatim if there is none.
Similarly for tagged_message. */
tagged_message :
/* When there are more than one matched token,
it's helpful to add Bison "named references"
in brackets. Thus, below, the rule can refer to
$ts rather than $2, or $msg rather than $4.
Makes it way easier to rearrange tokens while
you're experimenting. */
'@' tags[ts] SPACE prefixed_message[msg] {
$msg->tags = $ts;
$$ = $msg;
}
/* here's the pass-through case when there are
no tags on the message */
| prefixed_message
;
prefixed_message :
':' PREFIX[pfx] SPACE message[msg] {
$msg->prefix = $pfx;
$$ = $msg;
}
| message
;
message :
COMMAND[cmd] params[ps] {
struct irc_message *m = malloc(sizeof *m);
*m = (struct irc_message) {
.command=$cmd, .params=$ps
};
$$ = m;
}
;
tags :
TAG {
treemap *t = tm_new(derp_strcmp, NULL);
tm_insert(t, $1->k, $1->v);
$$ = t;
}
| tags[ts] ';' TAG[t] {
tm_insert($ts, $t->k, $t->v);
$$ = $ts;
}
;
params :
SPACE TRAILING {
$$ = l_new();
l_prepend($$, $2);
}
| SPACE MIDDLE[mid] params[ps] {
l_prepend($ps, $mid);
$$ = $ps;
}
| %empty {
$$ = l_new();
}
;
%%
int yyerror(char const *msg)
{
return fprintf(stderr, "%s\n", msg);
}
void message_print(struct irc_message *m)
{
if (m->tags)
{
struct tm_iter *it = tm_iter_begin(m->tags);
struct map_pair *p;
puts("Tags:");
while ((p = tm_iter_next(it)) != NULL)
printf("\t'%s'='%s'\n", (char*)p->k, (char*)p->v);
tm_iter_free(it);
}
if (m->prefix)
printf("Prefix: Nick %s, User %s, Host %s\n",
m->prefix->nick, m->prefix->user,
m->prefix->host);
if (m->command)
printf("Command: %s\n", m->command);
if (!l_is_empty(m->params))
{
puts("Params:");
for (list_item *li = l_first(m->params); li; li = li->next)
printf("\t%s\n", (char*)li->data);
}
}Returning to the lexer, here is the code with all the details filled in to construct yylval for the tokens.
/* irc.l - complete file */
%option noyywrap nounput noinput
%{
#include "irc.tab.h"
#define _XOPEN_SOURCE 600
#include <limits.h>
#include <stdlib.h>
#include <string.h>
%}
re_space [ ]+
re_host [[:alnum:]][[:alnum:]\.\-]*
re_nick [[:alpha:]][[:alnum:]\-\[\]\\`^{}_]*
re_user [~[:alpha:]][[:alnum:]]*
re_keyname [[:alnum:]\-]+
re_keyval [^ ;\r\n]*
re_command [[:alpha:]]+|[[:digit:]]{3}
re_middle [^: \r\n][^ \r\n]*
re_trailing [^\r\n]*
%x IN_TAGS IN_PREFIX IN_PARAMS
%%
@ { BEGIN IN_TAGS; return *yytext; }
: { BEGIN IN_PREFIX; return *yytext; }
{re_space} { return SPACE; }
{re_command} {
yylval.str = strdup(yytext);
BEGIN IN_PARAMS;
return COMMAND;
}
<IN_TAGS>\+?({re_host}\/)?{re_keyname}(={re_keyval})? {
struct map_pair *p = malloc(sizeof *p);
char *split = strchr(yytext, '=');
if (split)
*split = '\0';
*p = (struct map_pair){
.k = strdup(yytext),
.v = split ? strdup(split+1) : calloc(1,1)
};
yylval.pair = p;
return TAG;
}
<IN_TAGS>{re_space} {
BEGIN INITIAL;
return SPACE;
}
<IN_TAGS>; { return ';'; }
<IN_PREFIX>({re_host})|({re_nick})(!{re_user})?(@{re_host})? {
struct prefix *p = malloc(sizeof *p);
if (!p)
goto done;
*p = (struct prefix){0};
char *bang = strchr(yytext, '!'),
*at = strchr(yytext, '@');
if (!bang && !at)
{
p->host = strdup(yytext);
goto done;
}
if (bang) *bang = '\0';
if (at) *at = '\0';
p->nick = strdup(yytext);
if (bang)
p->user = strdup(bang+1);
if (at)
p->host = strdup(at+1);
done:
yylval.prefix = p;
BEGIN INITIAL;
return PREFIX;
}
<IN_PARAMS>{re_space} { return SPACE; }
<IN_PARAMS>{re_middle} {
yylval.str = strdup(yytext);
return MIDDLE;
}
<IN_PARAMS>:{re_trailing} {
yylval.str = strdup(yytext+1); /* trim : */
BEGIN INITIAL;
return TRAILING;
}
<*>\n|\r\n ; /* ignore */Build irc.y and irc.l according to our typical pattern (and link with libderp). Here’s an example of the IRCv3 parser in action:
# Try an example from
# https://ircv3.net/specs/extensions/message-tags#examples
$ ./irc <<EOF
@aaa=bbb;ccc;example.com/ddd=eee :[email protected] PRIVMSG me :Hello
EOFTags:
'aaa'='bbb'
'ccc'=''
'example.com/ddd'='eee'
Prefix: Nick nick, User ident, Host host.com
Command: PRIVMSG
Params:
me
Hello
Further resources
- POSIX (issue 7) specifications for Lex and Yacc. (To view POSIX docs locally, try begriffs/posix-man.)
- Lex & Yacc, 2nd ed by John R. Levine, Tony Mason, Doug Brown. Levine subsequently wrote an updated book called flex & bison: Text Processing Tools. However I got the older version to get a better feel for history and portability.
- To bridge the gap between core knowledge and the latest features, consult the GNU Bison manual and the Flex manual. (You can build the Flex manual from source, or download version 2.6.4 that I’ve pre-built for you as PDF.)
- Effective Flex & Bison by Chris L. verBurg is a collection of tips for “correctness, efficiency, robustness, complexity, maintainability and usability.” It’s clear Chris has plenty of experience writing real-world parsers.
- Vim has classic yacc highlighting built in, but you can add support for Bison extensions with justinmk/vim-syntax-extra.
Written by Joe "begriffs" Nelson. [email protected] 🔑