Starbeamrainbowlabs

Stardust
Blog


Archive


Mailing List Articles Atom Feed Comments Atom Feed Twitter Reddit Facebook

Tag Cloud

3d 3d printing account algorithms android announcement architecture archives arduino artificial intelligence artix assembly async audio automation backups bash batch blender blog bookmarklet booting bug hunting c sharp c++ challenge chrome os cluster code codepen coding conundrums coding conundrums evolved command line compilers compiling compression conference conferences containerisation css dailyprogrammer data analysis debugging defining ai demystification distributed computing dns docker documentation downtime electronics email embedded systems encryption es6 features ethics event experiment external first impressions freeside future game github github gist gitlab graphics guide hardware hardware meetup holiday holidays html html5 html5 canvas infrastructure interfaces internet interoperability io.js jabber jam javascript js bin labs latex learning library linux lora low level lua maintenance manjaro minetest network networking nibriboard node.js open source operating systems optimisation outreach own your code pepperminty wiki performance phd photos php pixelbot portable privacy problem solving programming problems project projects prolog protocol protocols pseudo 3d python reddit redis reference release releases rendering research resource review rust searching secrets security series list server software sorting source code control statistics storage svg systemquery talks technical terminal textures thoughts three thing game three.js tool tutorial twitter ubuntu university update updates upgrade version control virtual reality virtualisation visual web website windows windows 10 worldeditadditions xmpp xslt

Compilers, VMs, and JIT: Spot the difference

It's about time for another demystification post, I think :P This time, I'm going to talk about Compilers, Virtual Machines (VMs), and Just-In-Time Compilation (JIT) - and the way that they are both related and yet different.

Compilers

To start with, a compiler is a program that converts another program written in 1 language into another language (usually of a lower level). For example, gcc compiles C++ into native machine code.

A compiler usually has both a backend and a frontend. The frontend is responsible for the lexing and parsing of the source programming language. It's the part of the compiler that generates compiler warnings and errors. The frontend outputs an abstract syntax tree (or parse tree) that represents the source input program.

The backend then takes this abstract syntax tree, walks it, and generates code in the target output language. Such code generators are generally recursive in nature. Depending on the compiler toolchain, this may or may not be the last step in the build process.

gcc, for example, generates code to object files - which are then strung together by a linking program later in the build process.

Additional detail on the structure of a compiler (and how to build one yourself!) is beyond the scope of this article, but if you're interested I recommend reading my earlier Compilers 101 post.

Virtual Machines

A virtual machine comes in several flavours. Common to all types is the ability to execute instructions - through software - as if they were being executed on a 'real' hardware implementation of the programming language in question.

Examples here include:

  • KVM - The Linux virtual machine engine. Leverages hardware extensions to support various assembly languages that are implemented in real hardware - everything from Intel x86 Assembly to various ARM instruction sets. Virtual Machine Manager is a good GUI for it.
  • VirtualBox - Oracle's offering. Somewhat easier to use than the above - and cross-platform too!
  • .NET / Mono - The .NET runtime is actually a virtual machine in it's own right. It executes Common Intermediate Language in a managed environment.

It's also important to note the difference between a virtual machine and an emulator. An emulator is very similar to a virtual machine - except that it doesn't actually implement a language or instruction set itself. Instead, it 'polyfills' hardware (or environment) features that don't exist in the target runtime environment. A great example here is WINE or Parallels, which allow programs written for Microsoft Windows to be run on other platforms without modification.

Just-In-Time Compilation

JIT is sort of a combination of the above. The .NET runtime (officially knows as the Common Language Runtime, or CLR) is an example of this too - in that it compiles the CIL in the source assembly to native code just before execution.

Utilising such a mechanism does result in an additional delay during startup, but usually pays for itself in the vast performance improvements that can be made over an interpreter - as a JITting VM can automatically optimise code as it's executing to increase performance in real-time.

This is different to an interpreter, which reads a source program line-by-line - parsing and executing it as it goes. If it needs to go back to another part of the program, it will usually have to re-parse the code before it can execute it.

Conclusion

In this post, we've taken a brief look at compilers, VMs, and JIT. We've looked at the differences between them - and how they are different from their siblings (emulators and interpreters). In many ways, the line between the 3 can become somewhat blurred (hello, .NET!) - so learning the characteristics of each is helpful for disentangling different components of a program.

If there's anything I've missed - please let me know! If you're still confused on 1 or more points, I'm happy to expand on them if you comment below :-)

Found this useful? Spotted a mistake? Comment below!

Shift-Reduce Parser Part 2: Building Furniture (1)

Hello and welcome! I got a bit distracted by other things as you can tell, but I'm back with part 2 of my series on building a shift-reduce parser. If you're not sure what I'm talking about, then I'd advise reading part 1 first and then coming back here. It might be a good idea to re-read it anyway, juts to refresh your memory :-)

The same flowchart from last time, but with the parse table section highlighted.

Last time, we created some data classes to store the various rules and tokens that we'll be generating. Today, we're going to build on that and start turning a set of rules into a parse table. Let's introduce the rules we'll working with:

<start> ::= <expression>

<expression> ::= <expression> PLUS <value>
    | <term>

<term> ::= <term> DIVIDE <value>
    | <value>

<value> ::= <number>
    | BRACKET_OPEN <expression> BRACKET_CLOSE

<number> ::= DIGIT
    | <number> DIGIT

The above represents a very basic calculator-style syntax, which only supports adding and dividing. It's written in Backus-Naur Form, which is basically a standardised way of writing parsing rules.

To build a parse table, we first must understand what such a thing actually is. Let's take a look at an example:

state action goto
* + 0 1 $ E B
0 s1 s2 3 4
1 r4 r4 r4 r4 r4
2 r5 r5 r5 r5 r5
3 s5 s6 goal
4 r3 r3 r3 r3 r3
5 s1 s2 7
6 s1 s2 8
7 r1 r1 r1 r1 r1
8 r2 r2 r2 r2 r2

_(Source: Adapted from the LR Parser on Wikipedia.)_

While it looks complex at first, let's break it down. There are 3 parts to this table: The state, the action, and the goto. The action and goto represent What should happen if a particular token is encountered. In this case, the input stream contains both terminal (i.e. DIGIT, DIVIDE, BRACKET_CLOSE, etc. in the case of our BNF above) and non-terminal (i.e. number, term, expression, etc. in the case of our BNF above) symbols - if understand it correctly, so there are actually 2 parts to the table here to make sure that both are handled correctly.

We'll be connecting this to our lexer, which outputs only terminal symbols, so we should be ok I believe (if you know better, please post a comment below!). The state refers to the state in the table. As I've mentioned before, a given state may contain one or more configurations. It's these configurations that give rise to the actions in the table above, such as s2 (shift and then go to state 2) or r3 (reduce and jump to state 3).

To use the table, the parser must know what state it's in, and then take a look across the top row for the next symbol it has in the token stream. Once found, it can follow it down to figure out what action it should take, as explained above. If there isn't an action in the box, then there must be an error in the input, as the table doesn't tell us what to do in this situation. To that end, we should try and generate a meaningful error message to help the user to find the mistake in the input (or the developer in the parser!).

We're kind of getting ahead of ourselves here though. We need to build this table first, and to do that, we need to figure out which configurations go in which state. And, going down the rabbit hole, we need to know what a configuration is. Again, it's best if I demonstrate. Consider the following parsing rule from our example BNF at the beginning of this post:

<value> ::= BRACKET_OPEN <expression> BRACKET_CLOSE

A single configuration represent a possible state of the parser at a particular instant in time. I could split that above rule up like so:

<value> ::= BRACKET_OPEN * <expression> BRACKET_CLOSE
<value> ::= BRACKET_OPEN <expression> * BRACKET_CLOSE
<value> ::= BRACKET_OPEN <expression> BRACKET_CLOSE *

The asterisk represent where the parser might have gotten up to. Everything to the left is on the stack of the parser, and everything to the right hasn't happened yet.

Since this isn't a top-level rule (in our example that honour goes to the rule for the start non-terminal), the parser will never be in a position where the first item there doesn't exist yet on the stack, so we can ignore the configuration in which the asterisk would be to the left of BRACKET_OPEN.

Confused? Let me try and help here. Let's draw a diagram of how our parser is going to operate:

_(Source: Made by me, but adapted from the LR Parser article on Wikipedia)_

Basically, the parser will be taking in the input token stream and either shift a new terminal token onto the stack, or reduce one or more existing tokens on the stack into a single non-terminal token, which replaces those existing tokens on the stack. The configurations above represent possible states of the stack (the bit to the left of the asterisk), and possible directions that the parser could take when parsing (the bit to th right of the asterisk).

Finally, when the goal is reached, the output is returned to the caller (which, by the time we're done, should be a parse tree). Said tree can then be optimised and processed for whatever purpose we desire!

With this knowledge, we can deduce that we can build the entire table by recursing over the tree of rules from the start state. That way, we'll visit every rule that we'll need to parse everything required to reach the goal state by recursing over all the rules for all the non-terminals referenced by all the rules we visit. We could even generate a warning if we detect that some rules don't connect to this 'tree'. Here's a tree of our example ruleset from the beginning of this post:

A tree diagram of the rules detailed near the beginning of this post.

It's a bit spaghetti-ish, but it should be legible enough :P This gives us an idea as to how we're going to tackle this. Taking into account the data classes we created in the last post, we need to make sure we keep the following in mind:

  1. Since the main ShiftReduceParser class is going to hold the rules, the ParseTable class will need a reference to its parent ShiftReduceParser in order to query the rules.
  2. In light of this, the SHiftReduceParser should be responsible for satisfying any queries the ParseTable has about rules - the ParseTable should not have to go looking & filtering the rule list held by ShiftReduceParser itself.
  3. ParseTable will need a recursive method that will take a single top-level rule and recurse over it and its child rules (according to the tree I've talked about above)
  4. This method in ParseTale will need to be extremely careful it doesn't get stuck in a loop. To that end, it'll have to keep track of whether it's already processed a rule or not.
  5. It'll probably also have to keep track of which configurations it has added to the table class structure we defined in the last post to avoid adding rules twice.
  6. Once ParseTable has figured out all the configurations and grouped them all into the right states, it will then have to recurse over the generated table and fill in all the shift / reduce / goal action(s) - not forgetting about the links to the other states they should point to.

It's quite the laundry list! Thankfully, most of this is quite simple if we tackle it one step at a time. The most annoying bit is the grouping of configurations into states. This is done by looking at the token immediately before the asterisk in each configuration - all the configurations with the same token here will get grouped into the same state (while there are more complex algorithms that allow for more complex grammars, we'll stick with this for now as anything else makes my head hurt! Maybe in the future I'll look as figuring out precisely what kind of LR-style parser this is, and upgrading it to be a canonical LR(1) parser - the most advanced type I know of).

This is quite a lot to take in, so I think I'll leave this post here for you to digest - and we'll get to writing some code in the next one.

Found this useful? Spotted a mistake? Having trouble getting your head around it? Post a comment below!

Shift-reduce Parser Part 1: First Steps

Now that I've done the Languages and Compilers module at University, it's opened my eyes to a much better and more extensible way of handling complex text in a way that can easily be read by any subsequent code I write. To that end, I've found that at least 3 different various projects of mine could benefit from the inclusion of a shift-reduce parser, but I haven't been able to track one down for C♯ yet.

With this in mind, I thought to myself: "Why not build one myself?" Sure, my Lecturer didn't go into too many details as to how they work, but it can't be that difficult, can it? Oh, I had no idea.....

In this mini-series, I'm going to take you through the process of building a shift-reduce parser of your very own. As I write this, I haven't actually finished mine yet - I've just got to the important milestone of building a parse table! Thankfully, that's going to be a few posts away, as there's a fair amount of ground to cover until we get to that point.

Warning: This series is not for the faint of heart! It's rather complicated, and extremely abstract - making it difficult to get your head around. I've had great difficulty getting mine around it - and ended up writing it in multiple stages. If you want to follow along, be prepared for lots of research, theory, and preparation!

Let's start out by taking a look at what a shift-reduce parser does. If you haven't already, I'd recommend reading my previous compilers 101 post, which explains how to write a compiler, and the different stages involved. I'd also recommend checking out my earlier post on building a lexer, as it ties in nicely with the shift-reduce parser that we'll be building.

An overview of how a shift-reduce works.

In short, a shift-reduce parser compiles a set of BNF-style rules into a Parse Table, which it then utilises as a sort of state-machine when parsing a stream on input tokens. We'll take a look at this table compilation process in a future blog post. In this post, let's set up some data structures to help us along when we get to the compilation process in the next blog post. Here's the class structure we'll be going for:

An overview of the class structure we'll be creating in this blog post.

Let's start with a class to represent a single token in a rule:

public enum ParserTokenClass
{
    Terminal,
    NonTerminal
}

public struct ParserToken
{
    public readonly ParserTokenClass Class;
    public readonly string Type;

    public ParserToken(ParserTokenClass inTokenType, string inType)
    {
        Class = inTokenType;
        Type = inType;
    }

    public override bool Equals(object obj)
    {
        ParserToken otherTokenType = (ParserToken)obj;
        return Class == otherTokenType.Class && Type == otherTokenType.Type;
    }
    public override int GetHashCode()
    {
        return $"{Class}:{Type}".GetHashCode();
    }

    public override string ToString()
    {
        string terminalDisplay = Class == ParserTokenClass.Terminal ? "T" : "NT";
        return $"[ParserToken {terminalDisplay}: {Type}]";
    }

    public static ParserToken NonTerm(string inType)
    {
        return new ParserToken(ParserTokenClass.NonTerminal, inType);
    }
    public static ParserToken Term(string inType)
    {
        return new ParserToken(ParserTokenClass.Terminal, inType);
    }
}

Pretty simple! A token in a rule can either be a terminal (basically a token straight from the lexer), or a non-terminal (a token that the parser reduces a set of other tokens into), and has a type - which we represent as a string. Unfortunately due to the complex comparisons we'll be doing later, it's a huge hassle to use an enum with a template class as I did in the lexer I built that I linked to earlier.

Later on (once we've built the parse table), we'll extend this class to support attaching values and other such pieces of information to it, but for now we'll leave that out to aid simplicity.

I also override Equals() and GetHashCode() in order to make comparing tokens easier later on. Overriding ToString() makes the debugging process much easier later, as we'll see in the next post!

With a class to represent a token, we also need one to represent a rule. Let's create one now:

public class ParserRule
{
    /// <summary>
    /// A function to call when a reduce operation utilises this rule.
    /// </summary>
    public Action MatchingAction;
    public ParserToken LeftSide;
    public ParserToken[] RightSideSequence;

    public ParserRule(Action inMatchingAction, ParserToken inLeftSide, params ParserToken[] inRightSideSequence)
    {
        if (inLeftSide.Class != ParserTokenClass.NonTerminal)
            throw new ArgumentException("Error: The left-hand side must be a non-terminal token type.");

        MatchingAction = inMatchingAction;
        LeftSide = inLeftSide;
        RightSideSequence = inRightSideSequence;
    }

    public bool RightSideSequenceMatches(IEnumerable<ParserToken> otherRhs)
    {
        int i = 0;
        foreach (ParserToken nextToken in otherRhs)
        {
            if (!nextToken.Equals(RightSideSequence[i]))
                return false;

           i++;
        }
        return true;
    }

    public override string ToString()
    {
        StringBuilder result = new StringBuilder();
        result.Append($"ParserRule: {LeftSide} = ");
        foreach (ParserToken nextToken in RightSideSequence)
            result.Append($" {nextToken}");
        result.Append(";");
        return result.ToString();
    }
}

The above represents a single parser rule, such as <number> ::= <digit> <number>. Here we have the token on the left-hand-side (which we make sure is a non-terminal), and an array of tokens (which can be either terminal or non-terminal) for the right-hand-side. We also have an Action (which is basically a lamba function) that we'll call when we match against the rule, so that we have a place to hook into when we write code that actually does the tree building (not to be confused with the shift-reduce parser itself).

Here I also add a method that we'll need later, which compares an array of tokens against the current rule, to see if they match - and we override ToString() here again to aid debugging.

Now that we can represent tokens and rules, we can start thinking about representing configurations and states. Not sure what these are? All will be explained in the next post, don't worry :-) For now, A state can be seen as a row in the parse table, and it contains a number of configurations - which are like routes to different other states that the parser decides between, depending where it's gotten to in the token stream.

public enum ParseTableAction
{
    Shift,
    Reduce,
    Goal,
    Error
}

public class ParseTableConfiguration
{
    public readonly ParserRule Rule;
    public readonly int RhsPosition;

    public ParseTableAction LinkingAction = ParseTableAction.Error;
    public ParseTableState LinkingState = null;

    public ParserToken TokenAfterDot {
        get {
            return Rule.RightSideSequence[RhsPosition];
        }
    }
    public ParserToken TokenBeforeDot {
        get {
            return Rule.RightSideSequence[RhsPosition - 1];
        }
    }

    /// <summary>
    /// Whether this configuration is the last in the sequence of configurations for the specified rule or not.
    /// </summary>
    /// <value><c>true</c> if is last in rule; otherwise, <c>false</c>.</value>
    public bool IsLastInRule {
        get {
            return RhsPosition > Rule.RightSideSequence.Length - 1;
        }
    }

    public ParseTableConfiguration(ParserRule inRule, int inRhsPosition)
    {
        Rule = inRule;
        RhsPosition = inRhsPosition;
    }

    public IEnumerable<ParserToken> GetParsedRhs()
    {
        return Rule.RightSideSequence.TakeWhile((ParserToken token, int index) => index <= RhsPosition);
    }

    public bool MatchesRhsSequence(ParserRule otherRule)
    {
        int i = 0;
        foreach (ParserToken nextToken in otherRule.RightSideSequence)
        {
            if (i > RhsPosition)
                break;

            if (!nextToken.Equals(otherRule.RightSideSequence[i]))
                return false;

            i++;
        }
        return true;
    }

    public override bool Equals(object obj)
    {
        ParseTableConfiguration otherConfig = obj as ParseTableConfiguration;
        if (otherConfig == null) return false;
        return Rule == otherConfig.Rule && RhsPosition == otherConfig.RhsPosition;
    }
    public override int GetHashCode()
    {
        return $"{Rule}:{RhsPosition}".GetHashCode();
    }

    public override string ToString()
    {
        StringBuilder result = new StringBuilder();

        result.Append($"Configuration: {LinkingAction} ");
        if (LinkingState != null)
            result.Append($"to State {LinkingState.Id} ");
        result.Append($"{Rule.LeftSide} = ");

        for (int i = 0; i <= Rule.RightSideSequence.Length; i++)
        {
            if (i == RhsPosition)
                result.Append(" * ");
            if (i == Rule.RightSideSequence.Length)
                continue;
            result.Append($"{Rule.RightSideSequence[i]} ");
        }
        result.Append(";");
        return result.ToString();
    }
}

This class is slightly more complicated. First, we define an enum that holds information about what the parser should do if it chooses this configuration. Then, we declare the configuration class itself. This entails specifying which parse rule we're deriving the configuration from, and both which tokens in the right-hand-side of the rule should have been parsed already, and which should still be somewhere in the token stream. Again, I'll explain this in more detail in the next post!

Then, we declare a few utility methods and properties to fetch different parts of the configuration's rule, such as the token to the immediate left and right of the right-hand-side position (which was represented as a dot . in the book I followed), all the tokens before the dot ., and whether a given rule matches this configuration in the basis of everything before the dot ..

Finally, I continue with the trend of overriding the equality checking methods and ToString(), as it makes a world of difference in the code coming up in future blog posts!

Now that we've got a class for configurations, the last one on our list is one for the states themselves. Let's do that now:

public class ParseTableState
{
    public readonly ParseTable ParentTable;

    public int Id {
        get {
            return ParentTable.FindStateId(this);
        }
    }

    public List<ParseTableConfiguration> Configurations = new List<ParseTableConfiguration>();

    public ParseTableState(ParseTable inParentTable)
    {
        ParentTable = inParentTable;
    }

    public override string ToString()
    {
        StringBuilder result = new StringBuilder();
        foreach(ParseTableConfiguration nextConfiguration in Configurations)
            result.AppendLine($"     - {nextConfiguration}");
        return result.ToString();
    }
}

Much simpler than the configuration rule class, right? :P As I mentioned earlier, all a state consists of is a list of configurations in that state. In our case, we'll be assigning an id to the states in our parse table, so I include a property here that fetches a state's id from the parent parse table that it's part to make the later code as simple as possible.

Still with me? Congratulations! You've got the beginnings of a shift-reduce parser. Next time, we'll expand on some of theory behind the things I've touched on in this post, and possibly look at building the start of the recursive parse table builder itself.

Found this interesting? Confused about something? Comment below!

Change the way you think: Languages and Compilers in Review

Change the way you think.

I haven't seen it used recently, but when I first arrived at Hull University to start my degree, that's the phrase I saw on a number of posters about the place. I've been thinking about it a lot over the course of my degree - and I've found that it has rung true more than one. My understanding of programming has been transformed 3 times that I can count, and before I get to the Languages and Compilers review that this post is supposed to be about, I'd like to talk a little bit about that first.

The first time my understanding transformed was when I arrived in my first year. Up until that point, I'd been entirely self-taught - learning languages such as _GML_ and later Javascript. All of a sudden, I was introduced to the concept of _Object-Oriented Programming_, and suddenly I understood that by representing things as classes and objects, it was possible to build larger programs without having them fall apart because they were a nightmare to maintain.

Then, when I was finishing up my year in industry, my understanding was transformed again. I realised the value of the experienced I'd had while I was out on my year in industry - and they have not only shown me mechanisms by which a project can be effectively managed, but they have also given me a bit more of an idea what I'd like to do when I finish my degree.

Finally, I think that in Languages and Compilers is the third time I've changed the way I think about programming. It's transformed my understanding of how programming languages are built, how their compilers and interpreters work, why things are they way they are. It's also shown me that there's no best programming language - there only the best one for the task at hand.

With this in hand, it gives me the tools I need to pick up and understand a new language much more easily than I could before, by comparing it's features to the ones that I already know about. I've found a totally new way of looking at programming languages: looking at them not on their own, but how their lexical style and paradigms compare to those employed by other languages.

If you're considering whether a degree in Computer Science is worth it, I'd say that if you're serious about programming for a living, then the answer is a resounding yes.

Have your own thoughts to add about (your?) CS degree? Have a question? Post a comment below!

Nasty Example SPL Programs

I've written a few example test programs to debug various edge cases whilst writing a compiler for my coursework at University, and I thought I've post them here to help others out. As this coursework is ongoing I'm not going to say anything else, but I hope that these help someone else out :-)

Anyone posting coursework compiler code here will have it deleted (but additional example SPL programs are welcome). In addition, I make no guarantees that these programs are even valid SPL that should compile.

Nasty A

NastyA:
DECLARATIONS
    a OF TYPE INTEGER;
CODE
    READ(a);
    WRITE(a);
    WRITE(45, 'c', 5.68249)
ENDP NastyA.

Nasty B

NastyB:

DECLARATIONS

    i, toggle, step, limit OF TYPE INTEGER;

CODE

    1 -> toggle;
    1 -> step;
    10 -> limit;
    FOR i IS 1 BY step TO limit DO
        IF toggle = 1 THEN
            -1 -> step;
            -1 -> toggle;
            -10 -> limit
        ENDIF;
        WRITE(i);
        NEWLINE;
        WRITE(toggle);
        NEWLINE;
        WRITE(step);
        NEWLINE;
        NEWLINE
    ENDFOR;
    NEWLINE

ENDP NastyB.

Nasty C

NastyC:

DECLARATIONS
    a,b,c OF TYPE INTEGER;
    d,e,f,g OF TYPE REAL;
CODE
    5 + 8 -> a;
    4.5 + 6.8 -> d;
    WRITE(a);
    NEWLINE;
    WRITE(d);
    NEWLINE;
    WRITE((8 - 4));
    NEWLINE
ENDP NastyC.

Compilers 101: Build your own flex + bison compiler in a few easy(?) steps

So you want to build your own compiler? Great! Don't know where to start? This guide should help! At University, we're building our own compiler for a custom programming language invented by our lecturer with a pair of GNU tools by the name of flex and bison - which I've blogged about before. Since that post, I've learnt a ton about how the whole process works, so I thought I'd write up a more coherent blog post on the subject :P

A diagram explaining the process of building a compiler. Explained below.

Stage 1: Planning

The whole process starts with railroad diagrams (also known as flowcharts) of the language you want to write a compiler for. Having an accurate set of railroad diagrams is essential to understanding precisely how the language is put together, which is rather useful for the next step.

Converting the railroad diagrams into plain BNF (Backus Naur Form). Unfortunately, bison doesn't support EBNF-like notation at the current time, so only plain-old BNF will do.

Stage 2: Lexing

With your railroad diagrams converted into BNF, you can start writing code! The first chunk of code that needs writing is the lexer. Lexing is what flex is good at - and involves converting the input source code into lexemes - discrete sequences of characters that match a particular pattern and can be assigned a particular category name, turning it into a token. Perhaps an example would help. Consider the following:

void do_awesome_stuff(int a, string b) {
    /* Code here */
}

The above can be turned into a sequence of tokens, not unlike the following (ignoring whitespace tokens, of course):

TYPE: void
IDENTIFIER: do_awesome_stuff
OPEN_BRACKET: (
TYPE: int
IDENTIFIER: a,
COMMA: ,
TYPE: string
IDENTIFIER: b
CLOSE_BRACKET: )
OPEN_BRACE: {
COMMENT: /* Code here */*
CLOSE_BRACE: }

See? We can identify 8 token types in the source string: TYPE, IDENTIFIER, COMMA, OPEN_BRACKET, CLOSE_BRACKET, OPEN_BRACE, COMMENT, and CLOSE_BRACE. These types and the rules to match them can be found by analysing a combination of the railroad diagrams and the BNF you created earlier.

Stage 3: Parser the first

With a lexer in hand, we can now look at writing the parser. This is done in two stages. The parser itself, and upgrading said parser to generate a parse tree.

Let's talk about the parser first. The parser can be largely created simply by running a few regular-expression find and replace rules on your BNF, actually. From there, it's just a case of adding the header and the footer to complete the document.

Let's take a look at some example BNF:

<instructions> ::= START <lines> END

<lines> ::= <lines> <line>
    | <line>

<line> ::= <command>

<command> ::= <cmd_name> <number>

<cmd_name> ::= FD
    | BK
    | LT
    | RT

<number> ::= <number> <digit>
    | <digit>

<digit> ::= 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9

The above matches something like the following:

START
FD 100
RT 180
FD 125
LT 90
BK 50
END

Very interesting (a virtual cookie is available for anyone who gets the reference as to what this grammar represents!). Let's look at converting that into something bison can understand:

instructions : START lines END
    ;

lines : lines line
    | line
    ;

line : command
    ;

command : cmd_name number
    ;

cmd_name : FD
    | BK
    | LT
    | RT
    ;

number : number digit
    | digit
    ;

digit : 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9

That's looking much better already! Simply by using the regular expression substitutions:

  1. <([a-z_]+)> -> $1
  2. ::= -> :
  3. \n\n -> \n\t;\n\n

....we can get most of the way there to something that bison can understand. Next, we need to refactor it a bit to tell it which tokens are coming from the lexer (which I'll leave up to you to write as an exercise), so it doesn't get them confused with the rules - which are defined in the BNF(-like) rules.

Let's get rid of the rule for number and digit first, since we can do those in the lexer quite easily. Next, let's add a %token definition to the top to tell it which are coming from the lexer. It's good practice to define everything that comes from the lexer in uppercase, and everything that's a rule that exists only in bison in lowercase:

%start instructions
%token FD BK LT RT NUMBER START END

We've also defined a start symbol - the one which when bison reaches it, it knows that it's completed the parsing process, as bison is a bottom-up parser.

Lastly, we need to reference the lexer itself. Thankfully that's easy too by appending to your new bison file:

%%

#include "lex.yy.c"

Very nice. Don't forget about the new line at the end of the file - flex and bison will complain if it isn't present! Here's the completed bison file:

%start instructions
%token FD BK LT RT NUMBER START END

instructions : START lines END
    ;

lines : lines line
    | line
    ;

line : command
    ;

command : cmd_name NUMBER
    ;

cmd_name : FD
    | BK
    | LT
    | RT
    ;

%%

#include "lex.yy.c"

With a brand-new bison file completed, there's just one component of the parser left - a plain-old C file that calls it. Let's create one of those quickly:


#include 

int yyparse(void);

int main(void)
{       
    # if YYDEBUG == 1
    extern int yydebug;
    yydebug = 1;
    #endif

    return yyparse();
}

void yyerror(char *error_message)
{
    fprintf(stderr, "Error: %s\nExiting\n", error_message);
}

The highlighted lines enable a special debugging mode built-in to bison if the standard compile-time symbol YYDEBUG is specified - and bison is run with a few special parameters. Here's the sequence of commands needed to compile this:

flex lexer.l
bison -tv parser.y
gcc -Wall -Wextra -g parser.tab.c main.c -lfl -ly -DYYDEBUG -D_XOPEN_SOURCE=700

The gcc command is probably a bit long-winded, but it does several useful things for us:

  • Shows additional warnings just in case we've made a mistake that might be an issue later (-Wall -Wextra)
  • Include additional debugging information in the output file to allow debugging with gdb (the GNU Debugger) if necessary (-g)
  • Fix strange errors on some systems (-D_XOPEN_SOURCE=700)

If you're on a Windows system, you may need to remove the -ly - which appears to be required on the Linux machines I use - it tells gcc that we'll be referencing the bison library.

Stage 4: Parser again

Congratulations on getting this far! You've now got a lexer and a parser - so it's time to put them to use. This is done by utilising the parser to build a parse tree - a tree of nodes that represent the input. Here's an example tree:

An example parse tree.

As you can see, each high-level node references one or more lower-level nodes, and the structure of the tree represents the first 2 lines of the example input above. The nodes in yellow are lexical tokens that come directly from flex - these are called terminals, or leaf nodes. The ones in purple come from the bison rules (which we derived from the BNF we wrote at the beginning of this post) - and are non-terminals, or tree nodes.

With this in mind, let's introduce another feature or two of bison. Firstly, let's take a look at revising that %token declaration we created above:

%token FD BK LT RT START END
%token<val_num> NUMBER

The important bit here is the <val_num>. Here we tell bison that a value should be attached to the token NUMBER - and that it will be of type int. After telling bison that it should expect a value, we need to give it a place to put it. Let's write some more code to go just below the %token declarations:

%union {
    int val_num;
}

There we go! Excellent - we've got a place to put it. Now we just need to alter the lexer to convert the token value to an int and put it there. That's not too tough, thankfully - but if you're having trouble with it, here's a hint:

{number}        { yylval.val_num = atoi(yytext); return(NUMBER); }

Now we have it passing numbers correctly, let's look briefly at generating that parse tree. I'm not going to give the game away - just a few helpful hints as to what you need to do here - otherwise it's not as fun :P

Generating the parse tree can be considered the both the most challenging part of the experience (especially if you don't know what you're doing) and the easiest to deal with at same time. Knowing your stuff and your end goal before you start makes the whole process a lot easier.

The first major step is to create a struct that can represent a type of node in your parse tree. It might be useful to store several properties here - such as the node type (An enum will come in handy here), a spot for the value of a lexical token (or a reference to it in a symbol table if you have one), and references to child nodes in the parse tree.

The second major step of the process is to create a utility method that creates a new node of the tree on the heap, and then revise the bison file to get each rule to create new nodes on the tree in such a way that it creates a parse tree when it reaches start node (or top node of the tree - which, in the case of the above, is instructions). For the purposes of this post, I'll be using a method with this prototype:

TreeNode create_node(int item, int node_type, TreeNode left, TreeNode right);

Your tree node struct and subsequent creation method may vary. With this in hand, we can revise the bison rules we created above to create these nodes we've been talking about. Here's a quick pointer on how to revise the rule for command above:

command : cmd_name NUMBER   { $$ = create_node($2, NODE_COMMAND, $1, NULL); }
    ;

This might look a bit strange, but let's break it down. The bit in curly braces is some (almost) plain C code that creates the node and returns a pointer to it to bison. The $$ is the return value for that node - which, I might add reminds me of something I forgot above. We need to tell bison about our new tree node data type and which rules should return it:

%type<val_tnode> instructions lines line command cmd_name

/* And in %union { ... } ..... */
TreeNode val_tnode;

This is almost the same as the %token<val_num> we did before, but we're defining the return value of a rule this type - not a token. With that little interlude out of the way, let's return to the code above. $1 and $2 refer to the first and second items in the rule definition respectively - and hold the type that we defined above in the %token and %type directives. Since bison is a bottom-up parser - this means that by the time this code executes, all it's child nodes have (hopefully) been created - and we just have to tie them all up together with a new node. In the case of my little example above, $1 is of type TreeNode, and $2 is of type int (that is if I didn't make any mistakes further up!).

Stage 5: Blasting off to code generation and beyond

Phew! That's a lot of work. If you've read this far, thank you and well done! It's been a long journey for both you the reader and me the writer, but you're almost done.

While conceptually simple when broken down, the whole process actually gets rather involved - especially when writing the BNF and the parser (the latter of which can be a particular pain due to shift/reduce and reduce/reduce errors), and the amount of code and head scratching you've got to do to get to this point is enormous. My best advice is to take it slow and don't rush - you'll only cause most problems for yourself if you try and jump the gun. Make sure that each stage works as you intend before you continue - back-pedalling to fix an issue can be particularly annoying as it can be bothersome to work out which stage the bug is actually in.

The last step of the whole process is to actually do something with the parse tree we've worked so hard to create. Thankfully, that's not too difficult - as we can put some additional code in the { } block of the starting symbol to call methods that will do things like perform some optimisation, print the tree to the console - or generate some sweet code. While the actual generation of code is beyond the scope of this article, I may end up posting about some optimisation techniques you can use on a parse tree after I've finished fiddling with float handling, symbol tables, and initial code generation in my ACW (Assessed Course Work).

Found this useful? Found a bug in this post? Got a suggestion? Comment below! Since I don't have any real analytics on this blog besides the server logs, I've no idea if you've read it really unless you comment :P

Sources and Further Reading

Flexible Bison: Compiler Theory

One of the modules I've picked to do in my first semester of my third year at university is Lanuguages and their Compilers. Naturally, this entails building a compiler to compile a program that's written in a source language (spec provided, thankfully! :D) into plain old ANSI C.

The tools we're going to be using for this and the steps involved in actually compiling something into another language are somewhat complicated, and I'm having a bit of difficulty getting my head around the different steps a compiler goes through and how these steps relate to the tools we're going to be using. This blog post is my attempt to make sense of what I've learnt so far.

Firstly, let me introduce the tools I'll be using: GNU flex and GNU bison. Apparently they have a much shallower learning curve than other tools out there. At first, this doesn't appear to be the case - but the more I think about it the more I realise that this is true.

Flex, as far as I can tell, is a regular-expression based scanning tokeniser. In other words, it breaks down an input string into a series of tokens. It has a method that, when called, finds and returns the next token from the source string.

Bison uses tokenised output from flex to construct a parse tree. This parse tree is then optimised with redundant nodes removed, loops optimised, and other such tweaks. Finally, this optimised tree is then used to generate the output code.

With the cast introduced, I can get to the stages of a compiler:

  1. Lexical Analysis - Tokenisation
  2. Syntactical Analysis - Conversion of the token stream into a parse tree
  3. Semantic Analysis - Correction of the tree - e.g. automatic type conversion
  4. Intermediate Code Generation - Sometimes the compiler outputs sets of 3 values in a list of tuples. This was needed in older computers that couldn't hold all the steps of a compiler in memory at once! In my case, I'll be outputting the parse tree generated in step 3 I guess - but not to disk, as today we can have all the passes of the compiler in memory at the same time :D
  5. Optimisation - Redundant parts of the parse tree are removed etc. - loops are focused in particular
  6. Code Generation - The output code in the target language is generated here - whether that be in C (very common), Assembly, or another language.

This seems somewhat familiar. The Lexical Analysis phase seems to be rather similar to what flex is designed for, and the Semantic Analysis stage appears to what bison does. As for the other stages, I'm not really sure. I'm guessing that it'll become clear later as we build this compiler in stages - but I'm suspecting that we'll be writing them in plain C - unless I've missed something about bison :P

If you've made it this far, thanks for reading! If this feels somewhat disorganised - then it probably is - after all, this is mainly to get it all straight in my own head :P

If you've got any questions, please ask away in the comments below :-)

Understanding your compiler: C#

Sorry for the (very) late post! I fell rather ill on the day before I was going to write the next post, and haven't been well enough to write it until now! Hopefully more cool posts will be on their way soon :-)

A nice picture of Durham Cathedral. (Above: A nice picture of Durham Cathedral. Taken by @euruicimages.)

How many times have you just finished adding a new feature to your latest and greatest program, and hit the "Run" button in Visual Studio or Monodevelop? Have you ever wondered what happens under the hood? Have you ever encountered a strange build error, and just googled the error message in the hope of finding a solution?

In this post, I'm going take you on a journey to understand the build process for a typical C♯ program that happens every time you press Ctrl + Shift + B (or F8 in Monodevelop). I also hope to show you why I think it's important to understand what happens behind the scenes.

The hierarchy of the c♯ build system.

To start any journey, we need a map. I've created a simple diagram for our journey. We'll dive into the world of the project file further down. Lastly, we'll end our journey at the individual source code files that actually make up your project.

Before that though, we need to make a quick stop in your sln file. The sln file (or Solution File) is the place where everything starts. Normally, you only get one solution file for each project you create. It's the file that keeps track of the name of your project, its id, and where all the projects are that your solution references (they're usually in appropriately named subfolders). Note that it doesn't keep references to the other projects that you reference (that's done at the project file level).

Solution files sometimes automatically detected too (xbuild does this for sure). If you're in a command line or terminal, you can build an entire solution with just one command, without having to open Visual Studio or Monodevelop:

cd /path/to/awesome/project
# Windows
msbuild AwesomeSpaceProject.sln
# Linux / Mono
xbuild

With that out of the way, we can talk about project files. Project files define which source code files should be built, and how. You can even add custom triggers that run at any time before, during, or after the build process here (this is how I embedded commit hashes in a C♯ binary)!

Though the syntax is quite simple, much of it is, sadly, un or underdocumented. Thankfully most of it is quite intuitive, and a little experimentation goes a long way:

<Project>
    ....
    <ItemGroup>
        <Reference Include="System" />
        <Reference Include="System.Drawing" />
        ....
    </ItemGroup>
    <ItemGroup>
        <Compile Include="Program.cs" />
        <Compile Include="GameObjects/Spaceship.cs" />
        <Compile Include="GameObjects/Enemy.cs" />
        ....
    </ItemGroup>
    <ItemGroup>
        <EmbeddedResource Include="Resources\Spritesheet.png" />
        <EmbeddedResource Include="Resources\CoolSpace.ttf" />
    </ItemGroup>
    ....
</Project>

The above is a (simplified) example if a project file. It might be called something like CoolSpace.csproj. It references a few core assemblies, specifies a few C♯ files for compiling, and embeds a resource or two.

Of course, these files are generated for you automatically, but it's always helpful to know how to works so not only can you fix it more easily when it goes wrong, but you can also extend it to do extra things that you can't through the user interface, like use wildcards (careful! Too many wildcards can slow down your build) to specify a range of files that you wan to embed without having to go around them all one by one.

Next, let's talk about references. References allow you to pull in code from elsewhere, like a core system library, another project (that's how you link 2 projects in (or even out of!) a solution together), a library of sorts (like Nuget), or another random DLL or EXE (yes, you can reference other executable files) file lying around. With references, yo can spread your code around loads of files and pull in libraries from all over the galaxy and have the build process follow all of your references around and tie everything up into a nice neat package for you.

The next piece of the puzzle is the builder. The software that actually builds your code. On Windows with Visual Studio, it's called msbuild, and is the original build tool, created by Microsoft, that sets the standard. On Linux, there's a different (but very similar) tool called xbuild. xbuild implements the standard that msbuild sets, allowing solutions written on Windows with Visual Studio to be compiled 9 times out of 10 on Linux (and probably Mac too, though I don't have one to check - let me know in the comments if you have one!) without any changes.

The final stone in the bridge, so to speak, is your code itself. The builder (whether it be xbuild or msbuild) preprocesses each included file into an object file, which are then all linked together into the final executable binary, which itself contains common intermediate language (or CIL) which is executed by your operating system (or mono on Linux / Mac). While CIL is a different topic for a separate time, it's still important, so I'm mentioning it here.

If you've made it this far, congratulations! I hope that it made sense. If not (or even if it did!), please leave a comment down below and I'll try to help you out :-)

Transform your javascript with Browserify

Tired of battling endless <script> tags in your html files? Fed up with messing with a dozen libraries cluttering up the place? Can't see the wood from the trees? Try browserify (+ rollupify + wzrd)! It's amazing! It's awesome! It tidies up your code for you, so you don't have to (perhaps not :P)!

Seriously though, I've just been playing around with browserify, and it's awesome. It's that missing thing I've been trying to find for a long time. But what does it actually do, you ask?

Well, perhaps it's best to use an example. Consider these (relatively) harmless javascript files:

// SillySay.js
"use strict";

function sillySay(sentence) {
    // Split the sentence up into words
    var words = splitWords(sentence);

    // Loop over all the words in the above array and display them one by one
    for(let i in words) {
        alert(words[i]);
    }
}
// WordSplitter.js
"use strict";
function splitWords(sentence) {
    // Split the sentence on whitespace and return the resulting array
    return sentence.split(/\s+/g);
}

To use our (perfectly ridiculous) example code, we not only have to include SillySay.js, but WordSplitter.js (this could be a library you use for example) as well:

<!DOCTYPE html>
<html>
    <head>
        <meta charset='utf-8' />
        <title>Silly Say Demo</title>
    </head>
    <body>
        <p>Silly Say Demo</p>
        <p>By <a href="https://starbeamrainbowlabs.com/">Starbeamrainbowlabs</a></p>

        <!---------------->
        <script src="WordSplitter.js"></script>
        <script src="SillySay.js" charset="utf-8"></script>
        <script>
            window.addEventListener("load", function(event) {
                sillySay("This is a test");
            });
        </script>

        <style>
            html, body { font-size: 100%; }
            body
            {
                font-family: sans-serif;
            }
        </style>
    </head>
</html>

That's looking a bit messy, but imagine what it'd be like if you added another few libraries? Or a new feature in a separate file? See the problem? Browserify solves just this issue. It analyses the dependencies of the entry point to your app, and bundles up all your code into a single file, nice and neat. You can add extra transforms (like plugins), too, to do extra things like automatically insert your app's version, or include other data files automatically, or transpile other languages to javascript automagically (full list here).

Sounds cool yet? Let me give you a quick tutorial on how I set up Browserify, with Rollupify and Wzrd.

Firstly, we need to set things up. If you don't have Node.js installed, do that now. You'll also get npm - Node's (perfectly awesome!) package manager. Next, let's create a quick project and paste in the code above. I've recorded an asciicast (as you may have seen a few times before here) of me going through the process:

(Can't see the asciicast above? Try viewing it here

If you'd like to take a look at the final result, as written in the asciicast above, you can find it over here. Questions and comments are welcome below :-)

The lost post: Embedding commit hashes in C♯ binaries

I seem to remember that I've started to write this post no less than 3 times, and I've managed to lose the source each and every time. If you're reading this it means that I've managed to complete it this time :D

Imagine you've got a confused client on the phone, asking why the newest feature X in your program doesn't work. You ask them whether they have the latest version of your program installed... and they say that they don't know. Version numbers and changelogs to the rescue! Except.... that last release was rather rushed and you forgot to finish updating the changelog. This is just one scenario in which embedding the latest commit hash into your program is useful.

You could embed the short hash (the first 7 characters) into the version string, for example v3.6.1-375ae31. Then you could compare it against the revision history to see exactly what your codebase looked like when your client's version was built.

For a while now I've wanted to do just this, and now I've finally figured out how to do it cross-platform. This post documents how I went about doing it, and how you can do it too.

The basic principle of the idea is to run a command that will output the latest commit hash to a file before the build starts, and then embed that file into the resulting binary. To achieve this, we need to go about it in 2 parts. Firstly, we need to fiddle with the project file to add an optional pre-build event. Open the project file (MyProject.csproj) in your favourite text editor (but preferably not your favourite IDE such as Visual Studio or MonoDevelop) and add this to the bottom, just before the closing </Project>:

<Target Name="BeforeBuild" BeforeTargets="Build">
    <Exec Command="git rev-parse HEAD &gt;git-hash.txt" WorkingDirectory="$(ProjectDir)" IgnoreExitCode="true" />
</Target>

If you don't use Git, then change git rev-parse HEAD &gt;git-hash.txt in the above to the equivalent command for your version control system. For SVN, this stackoverflow question looks like it'll do the job for Windows - for Linux you should go here.

Once done, the next step is to add the generated file as an embedded resource. We can't do this with the GUI easily here since the file in question hasn't been generated yet! Add the following to the bottom of the csproj file, again just before the </Project>:

<ItemGroup>
    <EmbeddedResource Include="git-hash.txt" />
</ItemGroup>

Remember to change the git-hash.txt to whatever you changed it to above.

Next, save it and reopen the solution in your IDE. The final step is to actually utilise the commit hash (or revision number in SVN) in your program. Since it's just an embedded file, you can simply find it with a bit of reflection, and read it in with a StreamReader. I've written a good tutorial on how to do that over on my Embedding files in C♯ binaries post.

Make sure that your program is prepared to handle junk instead of a commit hash - you can't predict the contents of the embedded file if Git (or SVN) isn't installed on the machine used to build your project. If you want to require that the commit hash (or revision number) is actually present, just remove the IgnoreExitCode="true" from the first snippet above.

Sources

Art by Mythdael