Building a line-by-line lexer in C#
So there I was. It was a lazy afternoon before my final exam of the semester, and I was idly looking through some old code. One thing led to another, and I ended up writing a line-based scanning lexer in C# - and I thought I'd share it here, pulling it apart and putting it back together again.
The aim was to build something regular expression based, that would be flexible enough that it could be used in a wide-range of applications - but not too difficult or confusing to pick up, use and add to another project. The final code isn't currently available in a public repository (it's actually for a personal project - maybe I'll get around to refactoring it into a library if there's the demand), but I'll still post the finished code at the end of this post.
To start with, let's define some data classes to hold some input and output information. Hrm. Let's see - we'll need a class to represent a rule that the lexer will utilise, and one to represent the tokens that our lexer will be emitting. Let's deal with the one for the rules first:
public class LexerRule<TokenType>
{
public readonly TokenType Type;
public readonly Regex RegEx;
public bool Enabled { get; set; } = true;
public LexerRule(TokenType inName, string inRegEx)
{
if (!typeof(TokenType).IsEnum)
throw new ArgumentException($"Error: inName must be an enum - {typeof(TokenType)} passed");
Type = inName;
RegEx = new Regex(inRegEx);
}
public bool Toggle()
{
Enabled = !Enabled;
return Enabled;
}
}
Here I define a template (or generic) class that holds a regular expression, and associates it with a value from an enum. There's probably a better / cleaner way to make sure that TokenType
is an enum, but for now this should serve it's purpose just fine. I also add a simple Enabled
boolean property - as we'll be adding support for dynamically enabling and disabling rules later on.
Next up, let's tackle the class for the tokens that we're going to be emitting:
public class LexerToken<TokenType>
{
public readonly bool IsNullMatch = false;
public readonly LexerRule<TokenType> Rule = null;
public readonly Match RegexMatch;
public TokenType Type {
get {
try {
return Rule.Type;
}
catch (NullReferenceException) {
return default(TokenType);
}
}
}
private string nullValueData;
public string Value {
get {
return IsNullMatch ? nullValueData : RegexMatch.Value;
}
}
public LexerToken(LexerRule<TokenType> inRule, Match inMatch)
{
Rule = inRule;
RegexMatch = inMatch;
}
public LexerToken(string unknownData)
{
IsNullMatch = true;
nullValueData = unknownData;
}
public override string ToString()
{
return string.Format("[LexerToken: Type={0}, Value={1}]", Type, Value);
}
}
A little more complex, but still manageable. It, like it's LexerRule
cousin, is also a template (or generic) class. It holds the type of token it is and the regular expression Match
object generated during the scanning process. It also has something strange going on with Value
and nullValueData
- this is such that we can emit tokens with an 'unknown' type (more on that later) for the text in between that doesn't match any known rule. We'll be covering this later too.
With our data classes in place, it's time to turn our attention to the lexer itself. Let's put together some scaffolding to get an idea as to how it's going to work:
public class Lexer<TokenType>
{
public List<LexerRule<TokenType>> Rules { get; private set; } = new List<LexerRule<TokenType>>();
public int CurrentLineNumber { get; private set; } = 0;
public int CurrentLinePos { get; private set; } = 0;
public int TotalCharsScanned { get; private set; } = 0;
private StreamReader textStream;
public Lexer()
{
}
public void AddRule(LexerRule<TokenType> newRule);
public void AddRules(IEnumerable<LexerRule<TokenType>> newRules);
public void Initialise(StreamReader reader);
public IEnumerable<LexerToken<TokenType>> TokenStream();
public void EnableRule(TokenType type);
public void DisableRule(TokenType type);
public void SetRule(TokenType type, bool state);
}
There - that should do the trick! CurrentLineNumber
, CurrentLinePos
, and TotalCharsScanned
are properties to keep track of where we've got to, and textStream
is the StreamReader
we'll be reading data from. Then, we've got some methods that will add new lexer rules to Rules
enable and disable rules by token type, a method to initialise the lexer with the correct textStream
, and finally a generator method that will emit the tokens.
With our skeleton complete, let's fill out a few of those methods:
public void AddRule(LexerRule<TokenType> newRule)
{
Rules.Add(newRule);
}
public void AddRules(IEnumerable<LexerRule<TokenType>> newRules)
{
Rules.AddRange(newRules);
}
public void Initialise(StreamReader reader)
{
textStream = reader;
}
public void EnableRule(TokenType type)
{
SetRule(type, true);
}
public void DisableRule(TokenType type)
{
SetRule(type, false);
}
public void SetRule(TokenType type, bool state)
{
foreach (LexerRule<TokenType> rule in Rules)
{
// We have to do a string comparison here because of the generic type we're using in multiple nested
// classes
if (Enum.GetName(rule.Type.GetType(), rule.Type) == Enum.GetName(type.GetType(), type)) {
rule.Enabled = state;
return;
}
}
}
Very cool. None of this is particularly exciting - apart from SetBody
. In SetBody
we have to convert the type
argument ot a string in order to compare it to the rules in the Rules
list, as C♯ doesn't seem to understand that the TokenType
on the LexerRule
class is the same as the TokenType
on the Lexer
class - even though they have the same name! This did give me an idea for a trio of additional methods to make manipulating rules easier though:
public void EnableRulesByPrefix(string tokenTypePrefix)
{
SetRulesByPrefix(tokenTypePrefix, true);
}
public void DisableRulesByPrefix(string tokenTypePrefix)
{
SetRulesByPrefix(tokenTypePrefix, false);
}
public void SetRulesByPrefix(string tokenTypePrefix, bool state)
{
foreach (LexerRule<TokenType> rule in Rules)
{
// We have to do a string comparison here because of the generic type we're using in multiple nested
// classes
if (Enum.GetName(rule.Type.GetType(), rule.Type).StartsWith(tokenTypePrefix, StringComparison.CurrentCulture))
{
rule.Enabled = state;
}
}
}
This set of methods let us enable or disable rules based on what they are with. For example, if I have the 3 rules CommentStart
, CommentEnd
, and FunctionStart
, then calling EnableRulesByPrefix("Comment")
will enable CommentStart
and CommentEnd
, but not FunctionStart
.
With all the interfacing code out of the way, let's turn tot he real meat of the subject: The TokenStream
method. This is the method behind the magic. It's a bit complicated, so let's take it step-by-step. Firstly, we need to iterate over the lines in the StreamReader
:
string nextLine;
List<LexerToken<TokenType>> matches = new List<LexerToken<TokenType>>();
while ((nextLine = textStream.ReadLine()) != null)
{
CurrentLinePos = 0;
// .....
CurrentLineNumber++;
TotalCharsScanned += CurrentLinePos;
}
Fairly simple, right? I've used this construct a few times in the past. Before you ask, we'll get to matches
in just a moment :-) Next, we need another while
loop that iterates until we reach the end of the line:
while (CurrentLinePos < nextLine.Length)
{
matches.Clear();
foreach (LexerRule<TokenType> rule in Rules) {
if (!rule.Enabled) continue;
Match nextMatch = rule.RegEx.Match(nextLine, CurrentLinePos);
if (!nextMatch.Success) continue;
matches.Add(new LexerToken<TokenType>(rule, nextMatch));
}
// .....
}
Also fairly easy to follow. Basically, we clear the matches
list, and then attempt to find the next match from the current position on the line that we've reached (CurrentLinePos
) for every rule - and we store all the successful matches for further inspection and processing. We also make sure we skip any disabled rules here, too.
If we don't find any matching rules, then that must mean that we can't match the rest of this line to any known token. In this case, we want to emit an unknown token:
if (matches.Count == 0) {
yield return new LexerToken<TokenType>(nextLine.Substring(CurrentLinePos));
break;
}
This is what that extra LexerToken
constructor is for that we created above. Note that we yield return
here, instead of simply returning - this is very similar in construct to the yield
statement in Javascript that I blogged about before (and again here), in that they allow you to maintain state inside a method and return multiple values in sequence.
By an 'unknown token', I am referring to the default value of the TokenType
enum. Here's an example enum you might use with this lexer class:
public enum SimpleToken {
Unknown = 0,
Whitespace,
Comment,
BlockOpen,
BlockClose,
}
Since the value Unknown
is explicitly assigned the index 0
, we can be absolutely certain that it's the default value of this enum.
With our list of potential matches in hand, we next need to sort it in order to work our which one we should prioritise. After deliberating and experimenting a bit, I came up with this:
matches.Sort((LexerToken<TokenType> a, LexerToken<TokenType> b) => {
// Match of offset position position
int result = nextLine.IndexOf(a.RegexMatch.Value, CurrentLinePos, StringComparison.CurrentCulture) -
nextLine.IndexOf(b.RegexMatch.Value, CurrentLinePos, StringComparison.CurrentCulture);
// If they both start at the same position, then go with the longest one
if (result == 0)
result = b.RegexMatch.Length - a.RegexMatch.Length;
return result;
});
Basically, this sorts them so that the matches closest to the current scanning position on the list (CurrentLinePos
) are prioritised. If 2 or more matches tie based on this criterion, we prioritise the longest match. This seems to work for now - I can always change it later if it becomes a problem :P
With our matches sorted, we can now pick our the one we're going t emit next. Before we do so though, we need to take care of any characters between our current scanning position and the start of the next token:
LexerToken<TokenType> selectedToken = matches[0];
int selectedTokenOffset = nextLine.IndexOf(selectedToken.RegexMatch.Value, CurrentLinePos) - CurrentLinePos;
if (selectedTokenOffset > 0) {
CurrentLinePos += selectedTokenOffset;
yield return new LexerToken<TokenType>(nextLine.Substring(CurrentLinePos, selectedTokenOffset));
}
This emits these additional characters as an unknown token as we did before. Finally, we can emit the token and continue onto the next iteration of the loop:
CurrentLinePos += selectedToken.RegexMatch.Length;
yield return selectedToken;
That concludes our TokenStream
method - and with it this lexer! Here's the code in full:
using System;
using System.Collections.Generic;
using System.IO;
using System.Text.RegularExpressions;
namespace SBRL.Tools
{
public class LexerRule<TokenType>
{
public readonly TokenType Type;
public readonly Regex RegEx;
public bool Enabled { get; set; } = true;
public LexerRule(TokenType inName, string inRegEx)
{
if (!typeof(TokenType).IsEnum)
throw new ArgumentException($"Error: inName must be an enum - {typeof(TokenType)} passed");
Type = inName;
RegEx = new Regex(inRegEx);
}
public bool Toggle()
{
Enabled = !Enabled;
return Enabled;
}
}
public class LexerToken<TokenType>
{
public readonly bool IsNullMatch = false;
public readonly LexerRule<TokenType> Rule = null;
public readonly Match RegexMatch;
public TokenType Type {
get {
try {
return Rule.Type;
}
catch (NullReferenceException) {
return default(TokenType);
}
}
}
private string nullValueData;
public string Value {
get {
return IsNullMatch ? nullValueData : RegexMatch.Value;
}
}
public LexerToken(LexerRule<TokenType> inRule, Match inMatch)
{
Rule = inRule;
RegexMatch = inMatch;
}
public LexerToken(string unknownData)
{
IsNullMatch = true;
nullValueData = unknownData;
}
public override string ToString()
{
return string.Format("[LexerToken: Type={0}, Value={1}]", Type, Value);
}
}
public class Lexer<TokenType>
{
public List<LexerRule<TokenType>> Rules { get; private set; } = new List<LexerRule<TokenType>>();
public bool Verbose { get; set; } = false;
/// <summary>
/// The number of the line that currently being scanned.
/// </summary>
public int CurrentLineNumber { get; private set; } = 0;
/// <summary>
/// The number of characters on the current line that have been scanned.
/// </summary>
/// <value>The current line position.</value>
public int CurrentLinePos { get; private set; } = 0;
/// <summary>
/// The total number of characters currently scanned by this lexer instance.
/// Only updated every newline!
/// </summary>
public int TotalCharsScanned { get; private set; } = 0;
private StreamReader textStream;
public Lexer()
{
}
public void AddRule(LexerRule<TokenType> newRule)
{
Rules.Add(newRule);
}
public void AddRules(IEnumerable<LexerRule<TokenType>> newRules)
{
Rules.AddRange(newRules);
}
public void Initialise(StreamReader reader)
{
textStream = reader;
}
public IEnumerable<LexerToken<TokenType>> TokenStream()
{
string nextLine;
List<LexerToken<TokenType>> matches = new List<LexerToken<TokenType>>();
while ((nextLine = textStream.ReadLine()) != null)
{
CurrentLinePos = 0;
while (CurrentLinePos < nextLine.Length)
{
matches.Clear();
foreach (LexerRule<TokenType> rule in Rules) {
if (!rule.Enabled) continue;
Match nextMatch = rule.RegEx.Match(nextLine, CurrentLinePos);
if (!nextMatch.Success) continue;
matches.Add(new LexerToken<TokenType>(rule, nextMatch));
}
if (matches.Count == 0) {
string unknownTokenContent = nextLine.Substring(CurrentLinePos);
if(Verbose) Console.WriteLine("[Unknown Token: No matches found for this line] {0}", unknownTokenContent);
yield return new LexerToken<TokenType>(unknownTokenContent);
break;
}
matches.Sort((LexerToken<TokenType> a, LexerToken<TokenType> b) => {
// Match of offset position position
int result = nextLine.IndexOf(a.RegexMatch.Value, CurrentLinePos, StringComparison.CurrentCulture) -
nextLine.IndexOf(b.RegexMatch.Value, CurrentLinePos, StringComparison.CurrentCulture);
// If they both start at the same position, then go with the longest one
if (result == 0)
result = b.RegexMatch.Length - a.RegexMatch.Length;
return result;
});
LexerToken<TokenType> selectedToken = matches[0];
int selectedTokenOffset = nextLine.IndexOf(selectedToken.RegexMatch.Value, CurrentLinePos) - CurrentLinePos;
if (selectedTokenOffset > 0) {
string extraTokenContent = nextLine.Substring(CurrentLinePos, selectedTokenOffset);
CurrentLinePos += selectedTokenOffset;
if(Verbose) Console.WriteLine("[Unmatched content] '{0}'", extraTokenContent);
yield return new LexerToken<TokenType>(extraTokenContent);
}
CurrentLinePos += selectedToken.RegexMatch.Length;
if(Verbose) Console.WriteLine(selectedToken);
yield return selectedToken;
}
if(Verbose) Console.WriteLine("[Lexer] Next line");
CurrentLineNumber++;
TotalCharsScanned += CurrentLinePos;
}
}
public void EnableRule(TokenType type)
{
SetRule(type, true);
}
public void DisableRule(TokenType type)
{
SetRule(type, false);
}
public void SetRule(TokenType type, bool state)
{
foreach (LexerRule<TokenType> rule in Rules)
{
// We have to do a string comparison here because of the generic type we're using in multiple nested
// classes
if (Enum.GetName(rule.Type.GetType(), rule.Type) == Enum.GetName(type.GetType(), type)) {
rule.Enabled = state;
return;
}
}
}
public void EnableRulesByPrefix(string tokenTypePrefix)
{
SetRulesByPrefix(tokenTypePrefix, true);
}
public void DisableRulesByPrefix(string tokenTypePrefix)
{
SetRulesByPrefix(tokenTypePrefix, false);
}
public void SetRulesByPrefix(string tokenTypePrefix, bool state)
{
foreach (LexerRule<TokenType> rule in Rules)
{
// We have to do a string comparison here because of the generic type we're using in multiple nested
// classes
if (Enum.GetName(rule.Type.GetType(), rule.Type).StartsWith(tokenTypePrefix, StringComparison.CurrentCulture))
{
rule.Enabled = state;
}
}
}
}
}
It's a bit much to take in all at once, but hopefully by breaking it down into steps I've made it easier to understand how I built it, and how all the different pieces fit together. The only real difference between the above code and the code I walked through in this post is the Verbose
parameter I added for testing purposes, and the associated Console.WriteLine
calls. For fun, here's a very basic LOGO (also here) lexer. I've based it on what I remember from using MSWLogo / FMSLogo a long time ago (there seem to be many dialects around these days):
public enum LogoToken
{
Unknown = 0,
Whitespace,
FunctionForwards,
FunctionBackwards,
FunctionLeft,
FunctionRight,
FunctionPenUp,
FunctionPenDown,
Number
}
public class LogoLexer : Lexer<LogoToken>
{
public LogoLexer()
{
AddRules(new List<LexerRule<LogoToken>>() {
new LexerRule<LogoToken>(LogoToken.Whitespace, @"\s+"),
new LexerRule<LogoToken>(LogoToken.FunctionForwards, @"FD"),
new LexerRule<LogoToken>(LogoToken.FunctionBackwards, @"BK"),
new LexerRule<LogoToken>(LogoToken.FunctionLeft, @"LT"),
new LexerRule<LogoToken>(LogoToken.FunctionRight, @"RT"),
new LexerRule<LogoToken>(LogoToken.FunctionPenUp, @"PU"),
new LexerRule<LogoToken>(LogoToken.FunctionPenDown, @"PD"),
new LexerRule<LogoToken>(LogoToken.Number, @"\d+"),
});
}
}
Here's an example LOGO program that it parses:
FD 100 RT 90
FD 50 PU RT 180
BK 40 LT 45 FD 250
...and here's the output from lexing that example program:
[LexerToken: Type=FunctionForwards, Value=FD]
[LexerToken: Type=Whitespace, Value= ]
[LexerToken: Type=Number, Value=100]
[LexerToken: Type=Whitespace, Value= ]
[LexerToken: Type=FunctionRight, Value=RT]
[LexerToken: Type=Whitespace, Value= ]
[LexerToken: Type=Number, Value=90]
[Lexer] Next line
[LexerToken: Type=FunctionForwards, Value=FD]
[LexerToken: Type=Whitespace, Value= ]
[LexerToken: Type=Number, Value=50]
[LexerToken: Type=Whitespace, Value= ]
[LexerToken: Type=FunctionPenUp, Value=PU]
[LexerToken: Type=Whitespace, Value= ]
[LexerToken: Type=FunctionRight, Value=RT]
[LexerToken: Type=Whitespace, Value= ]
[LexerToken: Type=Number, Value=180]
[Lexer] Next line
[LexerToken: Type=FunctionBackwards, Value=BK]
[LexerToken: Type=Whitespace, Value= ]
[LexerToken: Type=Number, Value=40]
[LexerToken: Type=Whitespace, Value= ]
[LexerToken: Type=FunctionLeft, Value=LT]
[LexerToken: Type=Whitespace, Value= ]
[LexerToken: Type=Number, Value=45]
[LexerToken: Type=Whitespace, Value= ]
[LexerToken: Type=FunctionForwards, Value=FD]
[LexerToken: Type=Whitespace, Value= ]
[LexerToken: Type=Number, Value=250]
[Lexer] Next line
[Lexer] Next line
Very cool! This could easily be extended to support more of the LOGO syntax. As an exercise, can you extend it to support the REPEAT
statement? At some point in the future, I might go even further and build a bottom-up left-to-right shift-reduce parser, and combine it with this lexer and some BNF to create a parse tree.
Enjoyed this post? Don't quite understand something? Think you could do better? Post a comment below!