Stardust | Starbeamrainbowlabs

Compilers 101: Build your own flex + bison compiler in a few easy(?) steps

So you want to build your own compiler? Great! Don't know where to start? This guide should help! At University, we're building our own compiler for a custom programming language invented by our lecturer with a pair of GNU tools by the name of flex and bison - which I've blogged about before. Since that post, I've learnt a ton about how the whole process works, so I thought I'd write up a more coherent blog post on the subject :P

A diagram explaining the process of building a compiler. Explained below.

Stage 1: Planning

The whole process starts with railroad diagrams (also known as flowcharts) of the language you want to write a compiler for. Having an accurate set of railroad diagrams is essential to understanding precisely how the language is put together, which is rather useful for the next step.

Converting the railroad diagrams into plain BNF (Backus Naur Form). Unfortunately, bison doesn't support EBNF-like notation at the current time, so only plain-old BNF will do.

Stage 2: Lexing

With your railroad diagrams converted into BNF, you can start writing code! The first chunk of code that needs writing is the lexer. Lexing is what flex is good at - and involves converting the input source code into lexemes - discrete sequences of characters that match a particular pattern and can be assigned a particular category name, turning it into a token. Perhaps an example would help. Consider the following:

void do_awesome_stuff(int a, string b) {
    /* Code here */
}

The above can be turned into a sequence of tokens, not unlike the following (ignoring whitespace tokens, of course):

TYPE: void
IDENTIFIER: do_awesome_stuff
OPEN_BRACKET: (
TYPE: int
IDENTIFIER: a,
COMMA: ,
TYPE: string
IDENTIFIER: b
CLOSE_BRACKET: )
OPEN_BRACE: {
COMMENT: /* Code here */*
CLOSE_BRACE: }

See? We can identify 8 token types in the source string: TYPE, IDENTIFIER, COMMA, OPEN_BRACKET, CLOSE_BRACKET, OPEN_BRACE, COMMENT, and CLOSE_BRACE. These types and the rules to match them can be found by analysing a combination of the railroad diagrams and the BNF you created earlier.

Stage 3: Parser the first

With a lexer in hand, we can now look at writing the parser. This is done in two stages. The parser itself, and upgrading said parser to generate a parse tree.

Let's talk about the parser first. The parser can be largely created simply by running a few regular-expression find and replace rules on your BNF, actually. From there, it's just a case of adding the header and the footer to complete the document.

Let's take a look at some example BNF:

<instructions> ::= START <lines> END

<lines> ::= <lines> <line>
    | <line>

<line> ::= <command>

<command> ::= <cmd_name> <number>

<cmd_name> ::= FD
    | BK
    | LT
    | RT

<number> ::= <number> <digit>
    | <digit>

<digit> ::= 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9

The above matches something like the following:

START
FD 100
RT 180
FD 125
LT 90
BK 50
END

Very interesting (a virtual cookie is available for anyone who gets the reference as to what this grammar represents!). Let's look at converting that into something bison can understand:

instructions : START lines END
    ;

lines : lines line
    | line
    ;

line : command
    ;

command : cmd_name number
    ;

cmd_name : FD
    | BK
    | LT
    | RT
    ;

number : number digit
    | digit
    ;

digit : 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9

That's looking much better already! Simply by using the regular expression substitutions:

<([a-z_]+)> -> $1
::= -> :
\n\n -> \n\t;\n\n

....we can get most of the way there to something that bison can understand. Next, we need to refactor it a bit to tell it which tokens are coming from the lexer (which I'll leave up to you to write as an exercise), so it doesn't get them confused with the rules - which are defined in the BNF(-like) rules.

Let's get rid of the rule for number and digit first, since we can do those in the lexer quite easily. Next, let's add a %token definition to the top to tell it which are coming from the lexer. It's good practice to define everything that comes from the lexer in uppercase, and everything that's a rule that exists only in bison in lowercase:

%start instructions
%token FD BK LT RT NUMBER START END

We've also defined a start symbol - the one which when bison reaches it, it knows that it's completed the parsing process, as bison is a bottom-up parser.

Lastly, we need to reference the lexer itself. Thankfully that's easy too by appending to your new bison file:

%%

#include "lex.yy.c"

Very nice. Don't forget about the new line at the end of the file - flex and bison will complain if it isn't present! Here's the completed bison file:

%start instructions
%token FD BK LT RT NUMBER START END

instructions : START lines END
    ;

lines : lines line
    | line
    ;

line : command
    ;

command : cmd_name NUMBER
    ;

cmd_name : FD
    | BK
    | LT
    | RT
    ;

%%

#include "lex.yy.c"

With a brand-new bison file completed, there's just one component of the parser left - a plain-old C file that calls it. Let's create one of those quickly:


#include 

int yyparse(void);

int main(void)
{       
    # if YYDEBUG == 1
    extern int yydebug;
    yydebug = 1;
    #endif

    return yyparse();
}

void yyerror(char *error_message)
{
    fprintf(stderr, "Error: %s\nExiting\n", error_message);
}

The highlighted lines enable a special debugging mode built-in to bison if the standard compile-time symbol YYDEBUG is specified - and bison is run with a few special parameters. Here's the sequence of commands needed to compile this:

flex lexer.l
bison -tv parser.y
gcc -Wall -Wextra -g parser.tab.c main.c -lfl -ly -DYYDEBUG -D_XOPEN_SOURCE=700

The gcc command is probably a bit long-winded, but it does several useful things for us:

Shows additional warnings just in case we've made a mistake that might be an issue later (-Wall -Wextra)
Include additional debugging information in the output file to allow debugging with gdb (the GNU Debugger) if necessary (-g)
Fix strange errors on some systems (-D_XOPEN_SOURCE=700)

If you're on a Windows system, you may need to remove the -ly - which appears to be required on the Linux machines I use - it tells gcc that we'll be referencing the bison library.

Stage 4: Parser again

Congratulations on getting this far! You've now got a lexer and a parser - so it's time to put them to use. This is done by utilising the parser to build a parse tree - a tree of nodes that represent the input. Here's an example tree:

An example parse tree.

As you can see, each high-level node references one or more lower-level nodes, and the structure of the tree represents the first 2 lines of the example input above. The nodes in yellow are lexical tokens that come directly from flex - these are called terminals, or leaf nodes. The ones in purple come from the bison rules (which we derived from the BNF we wrote at the beginning of this post) - and are non-terminals, or tree nodes.

With this in mind, let's introduce another feature or two of bison. Firstly, let's take a look at revising that %token declaration we created above:

%token FD BK LT RT START END
%token<val_num> NUMBER

The important bit here is the <val_num>. Here we tell bison that a value should be attached to the token NUMBER - and that it will be of type int. After telling bison that it should expect a value, we need to give it a place to put it. Let's write some more code to go just below the %token declarations:

%union {
    int val_num;
}

There we go! Excellent - we've got a place to put it. Now we just need to alter the lexer to convert the token value to an int and put it there. That's not too tough, thankfully - but if you're having trouble with it, here's a hint:

{number}        { yylval.val_num = atoi(yytext); return(NUMBER); }

Now we have it passing numbers correctly, let's look briefly at generating that parse tree. I'm not going to give the game away - just a few helpful hints as to what you need to do here - otherwise it's not as fun :P

Generating the parse tree can be considered the both the most challenging part of the experience (especially if you don't know what you're doing) and the easiest to deal with at same time. Knowing your stuff and your end goal before you start makes the whole process a lot easier.

The first major step is to create a struct that can represent a type of node in your parse tree. It might be useful to store several properties here - such as the node type (An enum will come in handy here), a spot for the value of a lexical token (or a reference to it in a symbol table if you have one), and references to child nodes in the parse tree.

The second major step of the process is to create a utility method that creates a new node of the tree on the heap, and then revise the bison file to get each rule to create new nodes on the tree in such a way that it creates a parse tree when it reaches start node (or top node of the tree - which, in the case of the above, is instructions). For the purposes of this post, I'll be using a method with this prototype:

TreeNode create_node(int item, int node_type, TreeNode left, TreeNode right);

Your tree node struct and subsequent creation method may vary. With this in hand, we can revise the bison rules we created above to create these nodes we've been talking about. Here's a quick pointer on how to revise the rule for command above:

command : cmd_name NUMBER   { $$ = create_node($2, NODE_COMMAND, $1, NULL); }
    ;

This might look a bit strange, but let's break it down. The bit in curly braces is some (almost) plain C code that creates the node and returns a pointer to it to bison. The $$ is the return value for that node - which, I might add reminds me of something I forgot above. We need to tell bison about our new tree node data type and which rules should return it:

%type<val_tnode> instructions lines line command cmd_name

/* And in %union { ... } ..... */
TreeNode val_tnode;

This is almost the same as the %token<val_num> we did before, but we're defining the return value of a rule this type - not a token. With that little interlude out of the way, let's return to the code above. $1 and $2 refer to the first and second items in the rule definition respectively - and hold the type that we defined above in the %token and %type directives. Since bison is a bottom-up parser - this means that by the time this code executes, all it's child nodes have (hopefully) been created - and we just have to tie them all up together with a new node. In the case of my little example above, $1 is of type TreeNode, and $2 is of type int (that is if I didn't make any mistakes further up!).

Stage 5: Blasting off to code generation and beyond

Phew! That's a lot of work. If you've read this far, thank you and well done! It's been a long journey for both you the reader and me the writer, but you're almost done.

While conceptually simple when broken down, the whole process actually gets rather involved - especially when writing the BNF and the parser (the latter of which can be a particular pain due to shift/reduce and reduce/reduce errors), and the amount of code and head scratching you've got to do to get to this point is enormous. My best advice is to take it slow and don't rush - you'll only cause most problems for yourself if you try and jump the gun. Make sure that each stage works as you intend before you continue - back-pedalling to fix an issue can be particularly annoying as it can be bothersome to work out which stage the bug is actually in.

The last step of the whole process is to actually do something with the parse tree we've worked so hard to create. Thankfully, that's not too difficult - as we can put some additional code in the { } block of the starting symbol to call methods that will do things like perform some optimisation, print the tree to the console - or generate some sweet code. While the actual generation of code is beyond the scope of this article, I may end up posting about some optimisation techniques you can use on a parse tree after I've finished fiddling with float handling, symbol tables, and initial code generation in my ACW (Assessed Course Work).

Found this useful? Found a bug in this post? Got a suggestion? Comment below! Since I don't have any real analytics on this blog besides the server logs, I've no idea if you've read it really unless you comment :P

Sources and Further Reading

Main source: My own experience gained from my Languages and Compilers source I'm currently doing at University
Bison tutorial by Lan Gao
Flex tutorial by Lan Gao
A minimal tutorial on flex by YZU Design Automation Laboratory