Starbeamrainbowlabs

Stardust
Blog


Archive


Mailing List Articles Atom Feed Comments Atom Feed Twitter Reddit Facebook

Tag Cloud

3d 3d printing account algorithms android announcement architecture archives arduino artificial intelligence artix assembly async audio automation backups bash batch blender blog bookmarklet booting bug hunting c sharp c++ challenge chrome os cluster code codepen coding conundrums coding conundrums evolved command line compilers compiling compression conference conferences containerisation css dailyprogrammer data analysis debugging defining ai demystification distributed computing dns docker documentation downtime electronics email embedded systems encryption es6 features ethics event experiment external first impressions freeside future game github github gist gitlab graphics guide hardware hardware meetup holiday holidays html html5 html5 canvas infrastructure interfaces internet interoperability io.js jabber jam javascript js bin labs latex learning library linux lora low level lua maintenance manjaro minetest network networking nibriboard node.js open source operating systems optimisation outreach own your code pepperminty wiki performance phd photos php pixelbot portable privacy problem solving programming problems project projects prolog protocol protocols pseudo 3d python reddit redis reference release releases rendering research resource review rust searching secrets security series list server software sorting source code control statistics storage svg systemquery talks technical terminal textures thoughts three thing game three.js tool tutorial twitter ubuntu university update updates upgrade version control virtual reality virtualisation visual web website windows windows 10 worldeditadditions xmpp xslt

Disassembling .NET Assemblies with Mono

As part of the Component-Based Architectures module on my University course, I've been looking at what makes the .NET ecosystem tick, and how .NET assemblies (i.e. .NET .exe / .dll files) are put together. In the process, we looked as disassembling .NET assemblies into the text-form of the Common Intermediate Language (CIL) that they contain. The instructions on how to do this were windows-specific though - so I thought I'd post about the process on Linux and other platforms here.

Our tool of choice will be Mono - but before we get to that we'll need something to disassemble. Here's a good candidate for the role:

using System;

namespace SBRL.Demo.Disassembly {
    static class Program {
        public static void Main(string[] args) {
            int a = int.Parse(Console.ReadLine()), b = 10;
            Console.WriteLine(
                "{0} + {1} = {2}",
                a, b,
                a + b
            );
        }
    }
}

Excellent. Let's compile it:

csc Program.cs

This should create a new Program.exe file in the current directory. Before we get to disassembling it, it's worth mentioning how the compilation and execution process works in .NET. It's best explained with the aid of a diagram:

Left-to-right flowchart: Multiple source languages get compiled into Common Intermediate Language, which is then executed by an execution environment.

As is depicted in the diagram above, source code in multiple languages get compiled (maybe not with the same compiler, of course) into Common Intermediate Language, or CIL. This CIL is then executed in an Execution Environment - which is usually a virtual machine (Nope! not as in Virtual Box and KVM. It's not a separate operating system as such, rather than a layer of abstraction), which may (or may not) decide to compile the CIL down into native code through a process called JIT (Just-In-Time compilation).

It's also worth mentioning here that the CIL code generated by the compiler is in binary form, as this take up less space and is (much) faster for the computer to operate on. After all, CIL is designed to be efficient for a computer to understand - not people!

We can make it more readable by disassembling it into it's textual equivalent. Doing so with Mono is actually quite simple:

monodis Program.exe >Program.il

Here I redirect the output to a file called Program.il for convenience, as my editor has a plugin for syntax-highlighting CIL. For those reading without access to Mono, here's what I got when disassembling the above program:

.assembly extern mscorlib
{
  .ver 4:0:0:0
  .publickeytoken = (B7 7A 5C 56 19 34 E0 89 ) // .z\V.4..
}
.assembly 'Program'
{
  .custom instance void class [mscorlib]System.Runtime.CompilerServices.CompilationRelaxationsAttribute::'.ctor'(int32) =  (01 00 08 00 00 00 00 00 ) // ........

  .custom instance void class [mscorlib]System.Runtime.CompilerServices.RuntimeCompatibilityAttribute::'.ctor'() =  (
        01 00 01 00 54 02 16 57 72 61 70 4E 6F 6E 45 78   // ....T..WrapNonEx
        63 65 70 74 69 6F 6E 54 68 72 6F 77 73 01       ) // ceptionThrows.

  .custom instance void class [mscorlib]System.Diagnostics.DebuggableAttribute::'.ctor'(valuetype [mscorlib]System.Diagnostics.DebuggableAttribute/DebuggingModes) =  (01 00 07 01 00 00 00 00 ) // ........

  .hash algorithm 0x00008004
  .ver  0:0:0:0
}
.module Program.exe // GUID = {D6162DAD-AD98-45B3-814F-C646C6DD7998}

.namespace SBRL.Demo.Disassembly
{
  .class private auto ansi beforefieldinit Program
    extends [mscorlib]System.Object
  {

    // method line 1
    .method public static hidebysig 
           default void Main (string[] args)  cil managed 
    {
        // Method begins at RVA 0x2050
    .entrypoint
    // Code size 47 (0x2f)
    .maxstack 5
    .locals init (
        int32   V_0,
        int32   V_1)
    IL_0000:  nop 
    IL_0001:  call string class [mscorlib]System.Console::ReadLine()
    IL_0006:  call int32 int32::Parse(string)
    IL_000b:  stloc.0 
    IL_000c:  ldc.i4.s 0x0a
    IL_000e:  stloc.1 
    IL_000f:  ldstr "{0} + {1} = {2}"
    IL_0014:  ldloc.0 
    IL_0015:  box [mscorlib]System.Int32
    IL_001a:  ldloc.1 
    IL_001b:  box [mscorlib]System.Int32
    IL_0020:  ldloc.0 
    IL_0021:  ldloc.1 
    IL_0022:  add 
    IL_0023:  box [mscorlib]System.Int32
    IL_0028:  call void class [mscorlib]System.Console::WriteLine(string, object, object, object)
    IL_002d:  nop 
    IL_002e:  ret 
    } // end of method Program::Main

    // method line 2
    .method public hidebysig specialname rtspecialname 
           instance default void '.ctor' ()  cil managed 
    {
        // Method begins at RVA 0x208b
    // Code size 8 (0x8)
    .maxstack 8
    IL_0000:  ldarg.0 
    IL_0001:  call instance void object::'.ctor'()
    IL_0006:  nop 
    IL_0007:  ret 
    } // end of method Program::.ctor

  } // end of class SBRL.Demo.Disassembly.Program
}

Very interesting. There are a few things of note here:

  • The metadata at the top of the CIL tells the execution environment a bunch of useful things about the assembly, such as the version number, the classes contained within (and their signatures), and a bunch of other random attributes.
  • An extra .ctor method has been generator for us automatically. It's the class' constructor, and it automagically calls the base constructor of the object class, since all classes are descended from object.
  • The ints a and b are boxed before being passed to Console.WriteLine. Exactly what this does and why is quite complicated, and best explained by this Stackoverflow answer.
  • We can deduce that CIL is a stack-based language form the add instruction, as it has no arguments.

I'd recommend that you explore this on your own with your own test programs. Try changing things and see what happens!

  • Try making the Program class static
  • Try refactoring the int.Parse(Console.ReadLine()) into it's own method. How is the variable returned?

This isn't all, though. We can also recompile the CIL back into an assembly with the ilasm code:

ilasm Program.il

This makes for some additional fun experiments:

  • See if you can find where b's value is defined, and change it
  • What happens if you alter the Console.WriteLine() format string so that it becomes invalid?
  • Can you get ilasm to reassemble an executable into a .dll library file?

Found this interesting? Discovered something cool? Comment below!

Sources and Further Reading

Building a line-by-line lexer in C#

So there I was. It was a lazy afternoon before my final exam of the semester, and I was idly looking through some old code. One thing led to another, and I ended up writing a line-based scanning lexer in C# - and I thought I'd share it here, pulling it apart and putting it back together again.

The aim was to build something regular expression based, that would be flexible enough that it could be used in a wide-range of applications - but not too difficult or confusing to pick up, use and add to another project. The final code isn't currently available in a public repository (it's actually for a personal project - maybe I'll get around to refactoring it into a library if there's the demand), but I'll still post the finished code at the end of this post.

To start with, let's define some data classes to hold some input and output information. Hrm. Let's see - we'll need a class to represent a rule that the lexer will utilise, and one to represent the tokens that our lexer will be emitting. Let's deal with the one for the rules first:

public class LexerRule<TokenType>
{
    public readonly TokenType Type;
    public readonly Regex RegEx;
    public bool Enabled { get; set; } = true;

    public LexerRule(TokenType inName, string inRegEx)
    {
        if (!typeof(TokenType).IsEnum)
            throw new ArgumentException($"Error: inName must be an enum - {typeof(TokenType)} passed");

        Type = inName;
        RegEx = new Regex(inRegEx);
    }

    public bool Toggle()
    {
        Enabled = !Enabled;
        return Enabled;
    }
}

Here I define a template (or generic) class that holds a regular expression, and associates it with a value from an enum. There's probably a better / cleaner way to make sure that TokenType is an enum, but for now this should serve it's purpose just fine. I also add a simple Enabled boolean property - as we'll be adding support for dynamically enabling and disabling rules later on.

Next up, let's tackle the class for the tokens that we're going to be emitting:

public class LexerToken<TokenType>
{
    public readonly bool IsNullMatch = false;
    public readonly LexerRule<TokenType> Rule = null;
    public readonly Match RegexMatch;

    public TokenType Type {
        get {
            try {
                return Rule.Type;
            }
            catch (NullReferenceException) {
                return default(TokenType);
            }
        }
    }
    private string nullValueData;
    public string Value {
        get {
            return IsNullMatch ? nullValueData : RegexMatch.Value;
        }
    }

    public LexerToken(LexerRule<TokenType> inRule, Match inMatch)
    {
        Rule = inRule;
        RegexMatch = inMatch;
    }
    public LexerToken(string unknownData)
    {
        IsNullMatch = true;
        nullValueData = unknownData;
    }

    public override string ToString()
    {
        return string.Format("[LexerToken: Type={0}, Value={1}]", Type, Value);
    }
}

A little more complex, but still manageable. It, like it's LexerRule cousin, is also a template (or generic) class. It holds the type of token it is and the regular expression Match object generated during the scanning process. It also has something strange going on with Value and nullValueData - this is such that we can emit tokens with an 'unknown' type (more on that later) for the text in between that doesn't match any known rule. We'll be covering this later too.

With our data classes in place, it's time to turn our attention to the lexer itself. Let's put together some scaffolding to get an idea as to how it's going to work:

public class Lexer<TokenType>
{
    public List<LexerRule<TokenType>> Rules { get; private set; } = new List<LexerRule<TokenType>>();

    public int CurrentLineNumber { get; private set; } = 0;
    public int CurrentLinePos { get; private set; } = 0;
    public int TotalCharsScanned { get; private set; } = 0;

    private StreamReader textStream;

    public Lexer()
    {

    }

    public void AddRule(LexerRule<TokenType> newRule);
    public void AddRules(IEnumerable<LexerRule<TokenType>> newRules);

    public void Initialise(StreamReader reader);

    public IEnumerable<LexerToken<TokenType>> TokenStream();

    public void EnableRule(TokenType type);
    public void DisableRule(TokenType type);
    public void SetRule(TokenType type, bool state);
}

There - that should do the trick! CurrentLineNumber, CurrentLinePos, and TotalCharsScanned are properties to keep track of where we've got to, and textStream is the StreamReader we'll be reading data from. Then, we've got some methods that will add new lexer rules to Rules enable and disable rules by token type, a method to initialise the lexer with the correct textStream, and finally a generator method that will emit the tokens.

With our skeleton complete, let's fill out a few of those methods:

public void AddRule(LexerRule<TokenType> newRule)
{
    Rules.Add(newRule);
}
public void AddRules(IEnumerable<LexerRule<TokenType>> newRules)
{
    Rules.AddRange(newRules);
}

public void Initialise(StreamReader reader)
{
    textStream = reader;
}

public void EnableRule(TokenType type)
{
    SetRule(type, true);
}
public void DisableRule(TokenType type)
{
    SetRule(type, false);
}
public void SetRule(TokenType type, bool state)
{
    foreach (LexerRule<TokenType> rule in Rules)
    {
        // We have to do a string comparison here because of the generic type we're using in multiple nested
        // classes
        if (Enum.GetName(rule.Type.GetType(), rule.Type) == Enum.GetName(type.GetType(), type)) {
            rule.Enabled = state;
            return;
        }
    }
}

Very cool. None of this is particularly exciting - apart from SetBody. In SetBody we have to convert the type argument ot a string in order to compare it to the rules in the Rules list, as C♯ doesn't seem to understand that the TokenType on the LexerRule class is the same as the TokenType on the Lexer class - even though they have the same name! This did give me an idea for a trio of additional methods to make manipulating rules easier though:

public void EnableRulesByPrefix(string tokenTypePrefix)
{
    SetRulesByPrefix(tokenTypePrefix, true);
}
public void DisableRulesByPrefix(string tokenTypePrefix)
{
    SetRulesByPrefix(tokenTypePrefix, false);
}
public void SetRulesByPrefix(string tokenTypePrefix, bool state)
{
    foreach (LexerRule<TokenType> rule in Rules)
    {
        // We have to do a string comparison here because of the generic type we're using in multiple nested
        // classes
        if (Enum.GetName(rule.Type.GetType(), rule.Type).StartsWith(tokenTypePrefix, StringComparison.CurrentCulture))
        {
            rule.Enabled = state;
        }
    }
}

This set of methods let us enable or disable rules based on what they are with. For example, if I have the 3 rules CommentStart, CommentEnd, and FunctionStart, then calling EnableRulesByPrefix("Comment") will enable CommentStart and CommentEnd, but not FunctionStart.

With all the interfacing code out of the way, let's turn tot he real meat of the subject: The TokenStream method. This is the method behind the magic. It's a bit complicated, so let's take it step-by-step. Firstly, we need to iterate over the lines in the StreamReader:

string nextLine;
List<LexerToken<TokenType>> matches = new List<LexerToken<TokenType>>();
while ((nextLine = textStream.ReadLine()) != null)
{
    CurrentLinePos = 0;

    // .....

    CurrentLineNumber++;
    TotalCharsScanned += CurrentLinePos;
}

Fairly simple, right? I've used this construct a few times in the past. Before you ask, we'll get to matches in just a moment :-) Next, we need another while loop that iterates until we reach the end of the line:

while (CurrentLinePos < nextLine.Length)
{
    matches.Clear();
    foreach (LexerRule<TokenType> rule in Rules) {
        if (!rule.Enabled) continue;

        Match nextMatch = rule.RegEx.Match(nextLine, CurrentLinePos);
        if (!nextMatch.Success) continue;

        matches.Add(new LexerToken<TokenType>(rule, nextMatch));
    }

    // .....
}

Also fairly easy to follow. Basically, we clear the matches list, and then attempt to find the next match from the current position on the line that we've reached (CurrentLinePos) for every rule - and we store all the successful matches for further inspection and processing. We also make sure we skip any disabled rules here, too.

If we don't find any matching rules, then that must mean that we can't match the rest of this line to any known token. In this case, we want to emit an unknown token:

if (matches.Count == 0) {
    yield return new LexerToken<TokenType>(nextLine.Substring(CurrentLinePos));
    break;
}

This is what that extra LexerToken constructor is for that we created above. Note that we yield return here, instead of simply returning - this is very similar in construct to the yield statement in Javascript that I blogged about before (and again here), in that they allow you to maintain state inside a method and return multiple values in sequence.

By an 'unknown token', I am referring to the default value of the TokenType enum. Here's an example enum you might use with this lexer class:

public enum SimpleToken {
    Unknown = 0,

    Whitespace,

    Comment,
    BlockOpen,
    BlockClose,
}

Since the value Unknown is explicitly assigned the index 0, we can be absolutely certain that it's the default value of this enum.

With our list of potential matches in hand, we next need to sort it in order to work our which one we should prioritise. After deliberating and experimenting a bit, I came up with this:

matches.Sort((LexerToken<TokenType> a, LexerToken<TokenType> b) => {
    // Match of offset position position
    int result = nextLine.IndexOf(a.RegexMatch.Value, CurrentLinePos, StringComparison.CurrentCulture) -
                nextLine.IndexOf(b.RegexMatch.Value, CurrentLinePos, StringComparison.CurrentCulture);
    // If they both start at the same position, then go with the longest one
    if (result == 0)
        result = b.RegexMatch.Length - a.RegexMatch.Length;

    return result;
});

Basically, this sorts them so that the matches closest to the current scanning position on the list (CurrentLinePos) are prioritised. If 2 or more matches tie based on this criterion, we prioritise the longest match. This seems to work for now - I can always change it later if it becomes a problem :P

With our matches sorted, we can now pick our the one we're going t emit next. Before we do so though, we need to take care of any characters between our current scanning position and the start of the next token:

LexerToken<TokenType> selectedToken = matches[0];
int selectedTokenOffset = nextLine.IndexOf(selectedToken.RegexMatch.Value, CurrentLinePos) - CurrentLinePos;

if (selectedTokenOffset > 0) {
    CurrentLinePos += selectedTokenOffset;
    yield return new LexerToken<TokenType>(nextLine.Substring(CurrentLinePos, selectedTokenOffset));
}

This emits these additional characters as an unknown token as we did before. Finally, we can emit the token and continue onto the next iteration of the loop:

CurrentLinePos += selectedToken.RegexMatch.Length;
yield return selectedToken;

That concludes our TokenStream method - and with it this lexer! Here's the code in full:

using System;
using System.Collections.Generic;
using System.IO;
using System.Text.RegularExpressions;

namespace SBRL.Tools
{
    public class LexerRule<TokenType>
    {
        public readonly TokenType Type;
        public readonly Regex RegEx;
        public bool Enabled { get; set; } = true;

        public LexerRule(TokenType inName, string inRegEx)
        {
            if (!typeof(TokenType).IsEnum)
                throw new ArgumentException($"Error: inName must be an enum - {typeof(TokenType)} passed");

            Type = inName;
            RegEx = new Regex(inRegEx);
        }

        public bool Toggle()
        {
            Enabled = !Enabled;
            return Enabled;
        }
    }

    public class LexerToken<TokenType>
    {
        public readonly bool IsNullMatch = false;
        public readonly LexerRule<TokenType> Rule = null;
        public readonly Match RegexMatch;

        public TokenType Type {
            get {
                try {
                    return Rule.Type;
                }
                catch (NullReferenceException) {
                    return default(TokenType);
                }
            }
        }
        private string nullValueData;
        public string Value {
            get {
                return IsNullMatch ? nullValueData : RegexMatch.Value;
            }
        }

        public LexerToken(LexerRule<TokenType> inRule, Match inMatch)
        {
            Rule = inRule;
            RegexMatch = inMatch;
        }
        public LexerToken(string unknownData)
        {
            IsNullMatch = true;
            nullValueData = unknownData;
        }

        public override string ToString()
        {
            return string.Format("[LexerToken: Type={0}, Value={1}]", Type, Value);
        }
    }

    public class Lexer<TokenType>
    {
        public List<LexerRule<TokenType>> Rules { get; private set; } = new List<LexerRule<TokenType>>();

        public bool Verbose { get; set; } = false;

        /// <summary>
        /// The number of the line that currently being scanned.
        /// </summary>
        public int CurrentLineNumber { get; private set; } = 0;
        /// <summary>
        /// The number of characters on the current line that have been scanned.
        /// </summary>
        /// <value>The current line position.</value>
        public int CurrentLinePos { get; private set; } = 0;
        /// <summary>
        /// The total number of characters currently scanned by this lexer instance.
        /// Only updated every newline!
        /// </summary>
        public int TotalCharsScanned { get; private set; } = 0;

        private StreamReader textStream;

        public Lexer()
        {

        }

        public void AddRule(LexerRule<TokenType> newRule)
        {
            Rules.Add(newRule);
        }
        public void AddRules(IEnumerable<LexerRule<TokenType>> newRules)
        {
            Rules.AddRange(newRules);
        }

        public void Initialise(StreamReader reader)
        {
            textStream = reader;
        }

        public IEnumerable<LexerToken<TokenType>> TokenStream()
        {
            string nextLine;
            List<LexerToken<TokenType>> matches = new List<LexerToken<TokenType>>();
            while ((nextLine = textStream.ReadLine()) != null)
            {
                CurrentLinePos = 0;

                while (CurrentLinePos < nextLine.Length)
                {
                    matches.Clear();
                    foreach (LexerRule<TokenType> rule in Rules) {
                        if (!rule.Enabled) continue;

                        Match nextMatch = rule.RegEx.Match(nextLine, CurrentLinePos);
                        if (!nextMatch.Success) continue;

                        matches.Add(new LexerToken<TokenType>(rule, nextMatch));
                    }

                    if (matches.Count == 0) {
                        string unknownTokenContent = nextLine.Substring(CurrentLinePos);
                        if(Verbose) Console.WriteLine("[Unknown Token: No matches found for this line] {0}", unknownTokenContent);
                        yield return new LexerToken<TokenType>(unknownTokenContent);
                        break;
                    }

                    matches.Sort((LexerToken<TokenType> a, LexerToken<TokenType> b) => {
                        // Match of offset position position
                        int result = nextLine.IndexOf(a.RegexMatch.Value, CurrentLinePos, StringComparison.CurrentCulture) -
                                    nextLine.IndexOf(b.RegexMatch.Value, CurrentLinePos, StringComparison.CurrentCulture);
                        // If they both start at the same position, then go with the longest one
                        if (result == 0)
                            result = b.RegexMatch.Length - a.RegexMatch.Length;

                        return result;
                    });
                    LexerToken<TokenType> selectedToken = matches[0];
                    int selectedTokenOffset = nextLine.IndexOf(selectedToken.RegexMatch.Value, CurrentLinePos) - CurrentLinePos;

                    if (selectedTokenOffset > 0) {
                        string extraTokenContent = nextLine.Substring(CurrentLinePos, selectedTokenOffset);
                        CurrentLinePos += selectedTokenOffset;
                        if(Verbose) Console.WriteLine("[Unmatched content] '{0}'", extraTokenContent);
                        yield return new LexerToken<TokenType>(extraTokenContent);
                    }

                    CurrentLinePos += selectedToken.RegexMatch.Length;
                    if(Verbose) Console.WriteLine(selectedToken);
                    yield return selectedToken;
                }

                if(Verbose) Console.WriteLine("[Lexer] Next line");
                CurrentLineNumber++;
                TotalCharsScanned += CurrentLinePos;
            }
        }

        public void EnableRule(TokenType type)
        {
            SetRule(type, true);
        }
        public void DisableRule(TokenType type)
        {
            SetRule(type, false);
        }
        public void SetRule(TokenType type, bool state)
        {
            foreach (LexerRule<TokenType> rule in Rules)
            {
                // We have to do a string comparison here because of the generic type we're using in multiple nested
                // classes
                if (Enum.GetName(rule.Type.GetType(), rule.Type) == Enum.GetName(type.GetType(), type)) {
                    rule.Enabled = state;
                    return;
                }
            }
        }

        public void EnableRulesByPrefix(string tokenTypePrefix)
        {
            SetRulesByPrefix(tokenTypePrefix, true);
        }
        public void DisableRulesByPrefix(string tokenTypePrefix)
        {
            SetRulesByPrefix(tokenTypePrefix, false);
        }
        public void SetRulesByPrefix(string tokenTypePrefix, bool state)
        {
            foreach (LexerRule<TokenType> rule in Rules)
            {
                // We have to do a string comparison here because of the generic type we're using in multiple nested
                // classes
                if (Enum.GetName(rule.Type.GetType(), rule.Type).StartsWith(tokenTypePrefix, StringComparison.CurrentCulture))
                {
                    rule.Enabled = state;
                }
            }
        }
    }
}

It's a bit much to take in all at once, but hopefully by breaking it down into steps I've made it easier to understand how I built it, and how all the different pieces fit together. The only real difference between the above code and the code I walked through in this post is the Verbose parameter I added for testing purposes, and the associated Console.WriteLine calls. For fun, here's a very basic LOGO (also here) lexer. I've based it on what I remember from using MSWLogo / FMSLogo a long time ago (there seem to be many dialects around these days):

public enum LogoToken
{
    Unknown = 0,

    Whitespace,

    FunctionForwards,
    FunctionBackwards,
    FunctionLeft,
    FunctionRight,
    FunctionPenUp,
    FunctionPenDown,

    Number
}

public class LogoLexer : Lexer<LogoToken>
{
    public LogoLexer()
    {
        AddRules(new List<LexerRule<LogoToken>>() {
            new LexerRule<LogoToken>(LogoToken.Whitespace,    @"\s+"),

            new LexerRule<LogoToken>(LogoToken.FunctionForwards,    @"FD"),
            new LexerRule<LogoToken>(LogoToken.FunctionBackwards,    @"BK"),
            new LexerRule<LogoToken>(LogoToken.FunctionLeft,    @"LT"),
            new LexerRule<LogoToken>(LogoToken.FunctionRight,    @"RT"),

            new LexerRule<LogoToken>(LogoToken.FunctionPenUp,    @"PU"),
            new LexerRule<LogoToken>(LogoToken.FunctionPenDown,    @"PD"),

            new LexerRule<LogoToken>(LogoToken.Number,    @"\d+"),
        });
    }
}

Here's an example LOGO program that it parses:

...and here's the output from lexing that example program:

[LexerToken: Type=FunctionForwards, Value=FD]
[LexerToken: Type=Whitespace, Value= ]
[LexerToken: Type=Number, Value=100]
[LexerToken: Type=Whitespace, Value= ]
[LexerToken: Type=FunctionRight, Value=RT]
[LexerToken: Type=Whitespace, Value= ]
[LexerToken: Type=Number, Value=90]
[Lexer] Next line
[LexerToken: Type=FunctionForwards, Value=FD]
[LexerToken: Type=Whitespace, Value= ]
[LexerToken: Type=Number, Value=50]
[LexerToken: Type=Whitespace, Value= ]
[LexerToken: Type=FunctionPenUp, Value=PU]
[LexerToken: Type=Whitespace, Value= ]
[LexerToken: Type=FunctionRight, Value=RT]
[LexerToken: Type=Whitespace, Value= ]
[LexerToken: Type=Number, Value=180]
[Lexer] Next line
[LexerToken: Type=FunctionBackwards, Value=BK]
[LexerToken: Type=Whitespace, Value= ]
[LexerToken: Type=Number, Value=40]
[LexerToken: Type=Whitespace, Value= ]
[LexerToken: Type=FunctionLeft, Value=LT]
[LexerToken: Type=Whitespace, Value= ]
[LexerToken: Type=Number, Value=45]
[LexerToken: Type=Whitespace, Value= ]
[LexerToken: Type=FunctionForwards, Value=FD]
[LexerToken: Type=Whitespace, Value= ]
[LexerToken: Type=Number, Value=250]
[Lexer] Next line
[Lexer] Next line

Very cool! This could easily be extended to support more of the LOGO syntax. As an exercise, can you extend it to support the REPEAT statement? At some point in the future, I might go even further and build a bottom-up left-to-right shift-reduce parser, and combine it with this lexer and some BNF to create a parse tree.

Enjoyed this post? Don't quite understand something? Think you could do better? Post a comment below!

Untangling MSBuild: MSBuild for normal people

I don't know about you, but I find the documentation on MSBuild is be rather confusing. Even their definition of what MSBuild is is a bit misleading:

MSBuild is the build system for Visual Studio.

Whilst having to pull apart the .csproj file of a project of mine and put it back together again to get it to do what I wanted, I spent a considerable amount of time reading Microsoft's (bad) documentation and various web tutorials on what MSBuild is and what it does. I'm bound to forget what I've learnt, so I'm detailing it here both to save myself the bother of looking everything up again and to make sense of everything I've read and experimented with myself.

Before continuing, you might find my previous post, Understanding your compiler: C# an interesting read if you aren't already aware of some of the basics of Visual Studio solutions, project files, msbuild, and the C♯ compiler.

Let's start with a real definition. MSBuild is Microsoft's build framework that ties into Visual Studio, Mono, and basically anything in the .NET world. It has an XML-based syntax that allows one to describe what MSBuild has to do to build a project - optionally depending on other projects elsewhere in the file system. It's most commonly used to build C♯ projects - but it can be used to build other things, too.

The structure of your typical Visual Studio solution might look a bit like this:

The basic structure of a Visual Studio solution. Explained below.

As you can see, the .sln file references one or more .csproj files, each of which may reference other .csproj files - not all of which have to be tied to the same .sln file, although they usually are (this can be handy if you're using Git Submodules). The .sln file will also specify a default project to build, too.

The file extension .csproj isn't the only one recognised by MSBuild, either - others such as .pssproj (PowerShell project), .vcxproj (Visual C++ Project), .targets (Shared tasks + targets - we'll get to these later), and others - though the generic extension is simply .proj.

So far, I've observed that MSBuild is pretty intelligent about automatically detecting project / solution files in it's working directory - you can call it with msbuild in a directory and most of the time it'll find and build the right project - even if it finds a .sln file that it has to parse first.

Let's get to the real meat of the subject: targets. Consider the following:

<?xml version="1.0" encoding="utf-8"?>
<Project xmlns="http://schemas.microsoft.com/developer/msbuild/2003" DefaultTargets="Build" ToolsVersion="4.0">
    <Import Project="$(MSBuildBinPath)\Microsoft.CSharp.targets" />

    <Target Name="BeforeBuild">
        <Message Importance="high" Text="Before build executing" />
    </Target>
</Project>

I've simplified it a bit to make it a bit clearer, but in essence the above imports the predefined C♯ set of targets, which includes (amongst others) BeforeBuild, Build itself, and After Build - the former of which is overridden by the local MSBuild project file. Sound complicated? It is a bit!

MSBuild uses a system of targets. When you ask it to do something, you're actually asking it to reach a particular target. By default, this is the Build target. Targets can have dependencies, too - in the case of the above, the Build target depends on the (blank) BeforeBuild target in the default C♯ target set, which is then overridden by the MSBuild project file above.

The second key component of MSBuild project files are tasks. I've used one in the example above - the Message task which, obviously enough, outputs a message to the build output. There are lots of other types of task, too:

  • Reference - reference another assembly or core namespace
  • Compile - specify a C♯ file to compile
  • EmbeddedResource - specify a file to include as an embedded resource
  • MakeDir - create a directory
  • MSBuild - recursively build another MSBuild project
  • Exec - Execute a shell command (on Windows this is with cmd.exe)
  • Copy - Copy file(s) and/or director(y/ies) from one place to another

This is just a small sample of what's available - check out the MSBuild task reference for a full list and more information about how each is used. It's worth noting that MSBuild targets can include multiple tasks one after another - and they'll all be executed in sequence.

Unfortunately, if you try to specify a wildcard in an EmbeddedResource directive, both Visual Studio and Monodevelop 'helpfully' try to auto-expand these wildcards to reference the actual files themselves. To avoid this, a clever hack can be instituted that uses the CreateItem task:

<CreateItem Include="$(ProjectDir)/obj/client_dist/**">
    <Output ItemName="EmbeddedResource" TaskParameter="Include" />
</CreateItem>

The above loops above all the files that are in $(ProjectDir)/obj/client_dist, and dynamically creates an EmbeddedResource directive for each - thereby preventing annoying auto-expansion.

The $(ProjectDir) bit is a variable - the final key component of MSBuild project files. There are a number of built-in variables for different things, but you can also define your own.

  • $(ProjectDir) - The current project's root directory
  • $(SolutionDir) - The root directory of the current solution. Undefined if a project is being built directly.
  • $(TargetDir) - The output directory that the result of the build will be written to.
  • $(Configuration) - The selected configuration to build. Most commonly Debug or Release.

As with the tasks, there's a full list available that you can check out for more information.

That's the basics of how MSBuild works, as far as I'm aware at the moment. If I discover anything else of note, I'll post again in the future about what I've learnt. If this has helped, please comment below! If you're still confused, let me know in the comments too, and I'll try my best to help you out :-)

Want a tutorial on Git Submodules? Let me know by posting a comment below!

Art by Mythdael