Stardust | Starbeamrainbowlabs

Summer Project Part 2: Random Number Analysis with Gnuplot

In my last post about my Masters Summer Project, I talked about my plans and what I'm doing. In this post, I want to talk about random number generator evaluation.

As part of the Arduino-based Internet of Things device that will be collecting the data, I need to generate high-quality random numbers in order to ensure that the unique ids I use in my project are both unpredictable and unique.

In order to generate such numbers, I've found a library that exploits the jitter in the inbuilt watchdog timer that's present in the Arduino Uno. It's got a manual which is worth a read, as it explains the concepts behind it quite well.

After some experimenting, I ended up with a program that would generate random numbers as fast as it could:

// Generate_Random_Numbers - This sketch makes use of the Entropy library
// to produce a sequence of random integers and floating point values.
// to demonstrate the use of the entropy library
//
// Copyright 2012 by Walter Anderson
//
// This file is part of Entropy, an Arduino library.
// Entropy is free software: you can redistribute it and/or modify
// it under the terms of the GNU General Public License as published by
// the Free Software Foundation, either version 3 of the License, or
// (at your option) any later version.
//
// Entropy is distributed in the hope that it will be useful,
// but WITHOUT ANY WARRANTY; without even the implied warranty of
// MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
// GNU General Public License for more details.
//
// You should have received a copy of the GNU General Public License
// along with Entropy.  If not, see <http://www.gnu.org/licenses/>.
// 
// Edited by Starbeamrainbowlabs 2019

#include "Entropy.h"

void setup() {
    Serial.begin(9600);

    // This routine sets up the watch dog timer with interrupt handler to maintain a
    // pool of real entropy for use in sketches. This mechanism is relatively slow
    // since it will only produce a little less than two 32-bit random values per
    // second.
    Entropy.initialize();
}

void loop() {
    uint32_t random_long;
    random_long = Entropy.random();
    Serial.println(random_long);
}

As you can tell, it's based on one of the examples. You may need to fiddle around with the imports in order to get it to work, because the Arduino IDE is terrible.

With this in place and uploaded to an Arduino, all I needed to do was log the serial console output to a file. Thankfully, this is actually really quite easy on Linux:

screen -S random-number-generator dd if=/dev/ttyACM0 of=random.txt bs=1

Since I connected the Arduino in question to a Raspberry Pi I have acting as a file server, I've included a screen call here that ensures that I can close the SSH session without it killing the command I'm executing - retaining the ability to 'reattach' to it later to check on it.

With it set off, I left it for a day or 2 until I had at least 1 MiB of random numbers. Once it was done, I ended up with a file looking a little like this:

(Want more? Download the entire set here.)

In total, it generated 134318 numbers for me to play with, which should be plenty to graph their distribution.

Graphing such a large amount of numbers requires a special kind of program. Since I've used it before, I reached for Gnuplot.

A histogram is probably the best kind of graph for this purpose, so I looked up how to get gnuplot to draw one and tweaked it for my own purposes.

I didn't realise that you could do arbitrary calculations inside a Gnuplot graph definition file, but apparently you can. The important bit below is the bin_width variable:

set key off
set border 3

# Add a vertical dotted line at x=0 to show centre (mean) of distribution.
#set yzeroaxis

# Each bar is half the (visual) width of its x-range.
#set boxwidth 0.05 absolute
#set style fill solid 1.0 noborder

bin_width = 171797751.16;

bin_number(x) = floor(x / bin_width)

rounded(x) = bin_width * ( bin_number(x) + 0.5 )

set terminal png linewidth 3 size 1920,1080

plot 'random.txt' using (rounded($1)):(1) smooth frequency with boxes

It specifies the width of each bar on the graph. To work this out, we need to know the maximum number in the dataset. Then we can divide it by the target number of bins to get the width thereof. Time for some awk!

awk 'BEGIN { a = 0; b=999999999999999 } { if ($1>0+a) a=$1; if ($1 < 0+b) b=$1; } END { print(a, b); }' <random.txt

This looks like a bit of a mess, so let's unwind that awk script so that we can take a better look at it.

BEGIN {
    max = 0;
    min = 999999999999999
}
{
    if ($1 > max)
        max = $1;
    if ($1 < min)
        min = $1;
}
END {
    print("Max:", max, "Min:", min);
}

Much better. In short, it keeps a record of the maximum and minimum numbers it's seen so far, and updates them if it sees a better one. Let's run that over our random numbers:

awk -f minmax.awk

Excellent. 4294943779 ÷ 25 = 171797751.16 - which is how I arrived at that value for the bin_width earlier.

Now we can render our graph:

gnuplot random.plt >histogram.png && optipng -o7 histogram.png

I always optimise images with either optipng or jpegoptim to save on storage space on the server, and bandwidth for readers - and in this case the difference was particularly noticeable. Here's the final graph:

The histograph generated by the above command.

As you can see, the number of numbers in each bin is pretty even, so we can reasonably conclude that the algorithm isn't too terrible.

What about uniqueness? Well, that's much easier to test than the distribution. If we count the numbers before and after removing duplicates, it should tell us how many duplicates there were. There's even a special command for it:

wc -l <random.txt 
134318
sort <random.txt | uniq | wc -l
134317
sort <random.txt | uniq --repeated
1349455381

Very interesting. Out of ~134K numbers, there's only a single duplicate! I'm not sure whether that's a good thing or not, as I haven't profiled any other random number generated in this manner before.

Since I'm planning on taking 1 reading a minute for at least a week (that's 10080 readings), I'm not sure that I'm going to get close to running into this issue - but it's always good to be prepared I guess......

Found this interesting? Got a better way of doing it? Comment below!

Sources and Further Reading

Entopy Library for Arduino
Gnuplot
Optipng
Jpegoptim
Optipng on wapm.io - I didn't even know that WebAssembly had a package manager. I definitely need to investigate that now.....
LoRa library that has a random byte generation feature based on untuned radio input

Summer Project Part 1: LoRaWAN Signal Mapping!

What? A new series (hopefully)! My final project for my Masters in Science course at University is taking place this summer, and on the suggestion of Rob Miles I'll be blogging about it along the way.

In this first post, I'd like to talk a little bit about the project I've picked and my initial thoughts.

As you have probably guessed from the title of this post, the project I've picked is on mapping LoRaWAN signal coverage. I'm specifically going to look at that of The Things Network, a public LoRaWAN network. I've actually posted about LoRa before, so I'd recommend you go back and read that post first before continuing with this one.

The plan I've drawn up so far is to build an Internet of Things device with an Arduino and an RFM95 (a LoRa modem chip) to collect a bunch of data, which I'll then push through some sort of AI to fill in the gaps.

The University have been kind enough to fund some of the parts I'll need, so I've managed to obtain some of them already. This mainly includes:

Some of the power management circuitry
An Arduino Uno
A bunch of wires
A breadboard
A 9V battery holder (though I suspect I'll need a different kind of battery that can be recharged)
Some switches

(Above: The parts that I've collected already. I've still got a bunch of parts to go though.)

I've ordered the more specialised parts through my University, and they should be arriving soon:

I'll also need a project box to keep it all in if I can't gain access to the University's 3D printers, but I'll tackle that later.

I'll store on a local microSD card for each message a random id and the location a message was sent. I'll transmit the location and the unique id via LoRaWAN, and the server will store it - along with the received signal strength from the different gateways that received the message.

Once a run is complete, I'll process the data and pair the local readings from the microSD card up with the ones the database has stored, so that we have readings from the 'block spots' where there isn't currently any coverage.

By using a unique random id instead of a timestamp, I can help preserve the privacy oft he person carrying the device. Of course, I can't actually ask anyone to carry the device around until I've received ethical approval from the University to do so. I've already submitted the form, and I'm waiting to hear back on that.

While I'm waiting, I'm starting to set up the backend application server. I've decided to write it in Node.js using SQLite to store the data, so that if I want to do multiple separate runs to compare coverage before and after a gateway is installed, I can do so easily by just moving on to a new SQLite database file.

In the next post, I might talk a little bit about how I'm planning on generating the random ids. I'd like to do some research into the built-in random() function and how ti compares to other unpredictable sources of randomness, such as comparing clocks.

shunction: Self-hosted Azure Functions!

It's not 1st April, but if it was - this post would be the perfect thing to release! It's about shunction, a parody project I've written. While it is a parody, hopefully it is of some use to someone :P

The other week, I discovered Azure Functions - thanks to Rob Miles. In the same moment, I thought that it would be fairly trivial to build a system by which you could self-host it - and shunction was born.

Built upon the lantern build engine, shunction lets you keep a folder of executable scripts and execute them at will. It supports 4 modes of operation:

adhoc - One-off runs
cron - Regularly scheduled execution
inotify - Execute when something changes on disk
http - Listen for HTTP requests

Of course, the system can be upgraded at a later date to support additional operating modes. The last 2 in the list start a persistent and long-running process, which will only exit if terminated.

Shunction is designed to run on Linux systems. Because of the shebang, any executable script (be it a shell script, a Python script, a Node.js script, or otherwise) is supported.

If you've got multiple tasks you want have your server running and want to keep them all in one place, then shunction would allow you to do so. For example, you could keep the task scripts in a git repository that you clone down onto the server, and then use shunction to execute them as needed.

Easy Node.JS Dependencies Updates

Once you've had a project around for a while, it's inevitable that dependency updates will become available. Unfortunately, npm (the Node Package Manager), while excellent at everything else, is completely terrible at notifying you about updates.

The solution to this is, of course, to use an external tool. Personally, I use npm-check, which is also installable via npm. It shows you a list of updates to your project's dependencies, like so:

(Can't see the above? View it directly on asciinema.org)

It even supports the packages that you've install globally too, which no other tool appears to do as far as I can tell (although it does appear to miss some packages, such as npm and itself). To install it, simply do this:

sudo npm install --global npm-check

Once done, you can then use it like this:

# List updates, but don't install them
npm-check
# Interactively choose the updates to install
npm-check -u
# Interactively check the globally-installed packages
sudo npm-check -gu

The tool also checks to see which of the dependencies are actually used, and prompts you to check the dependencies it think you're not using (it doesn't always get it right, so check carefully yourself before removing!). There's an argument to disable this behaviour:

npm-check --skip-unused

Speaking of npm package dependencies, the other big issue is security vulnerabilities. GitHub have recently started giving maintainers of projects notifications about security vulnerabilities in their dependencies, which I think is a brilliant idea.

Actually fixing said vulnerabilities is a whole other issue though. If you don't want to update all the dependencies of a project to the latest version (perhaps you're just doing a one-off contribution to a project or aren't very familiar with the codebase yet), there's another tool - this time built-in to npm - to help out - the npm audit subcommand.

# Check for reported security issues in your dependencies
npm audit
# Attempt to fix said issues by updating packages *just* enough
npm audit fix

(Can't see the above? View it directly on asciinema.org)

This helps out a ton with contributing to other projects. Issues arise when the vulnerabilities are not in packages you directly depend on, but instead in packages that you indirectly depend on via dependencies of the packages you've installed.

Thankfully the vulnerabilities in the above can all be traced back to development dependencies (and aren't essential for Peppermint Wiki itself), but it's rather troubling that I can't install updated packages because the packages I depend on haven't updated their dependencies.

I guess I'll be sending some pull requests to update some dependencies! To help me in this, the output of npm audit even displays the dependency graph of why a package is installed. If this isn't enough though, there's always the npm-why which, given a package name, will figure out why it's installed.

Found this interesting? Got a better solution? Comment below!

Generating Atom 1.0 Feeds with PHP (the proper way)

I've generated Atom feeds in PHP before, but recently I went on the hunt to discover if PHP has something like C♯'s XMLWriter class - and it turns out it does! Although poorly documented (probably why I didn't find it in the first place :P), it's actually quite logical and easy to pick up.

To this end, I thought I'd blog about how I used to write the Atom 1.0 feed generator for recent changes on Pepperminty Wiki that I've recently implemented (coming soon in the next release!), as it's so much cleaner than atom.gen.php that I blogged about before! It's safer too - as all the escaping is handled automatically by PHP - so there's no risk of an injection attack because I forgot to escape a character in my library code.

It ends up being a bit verbose, but a few helper methods (or a wrapper class?) should alleviate this nicely - I might refactor it at a later date.

To begin, we need to create an instance of the aptly-named XMLLWriter class. It's probable that you'll need the php-xml package installed in order to use this.

$xml = new XMLWriter();
$xml->openMemory();
$xml->setIndent(true); $xml->setIndentString("\t");
$xml->startDocument("1.0", "utf-8");

In short, the above creates a new XMLWriter instance, instructs it to write the XML to memory, enables pretty-printing of the output, and writes the standard XML document header.

With the document created, we can begin to generate the Atom feed. To figure out the format (I couldn't remember from when I wrote atom.gen.php - that was ages ago :P), I ended following this helpful guide on atomenabled.org. It seems familiar - I think I might have used it when I wrote atom.gen.php. To start, we need something like this:

<feed xmlns="http://www.w3.org/2005/Atom">
    ......
</feed>

In PHP, that translates to this:

$xml->startElement("feed");
$xml->writeAttribute("xmlns", "http://www.w3.org/2005/Atom");

$xml->endElement(); // </feed>

Next, we probably want to advertise how the Atom feed was generated. Useful for letting the curious know what a website is powered by, and for spotting usage of your code out in the wild!

Since I'm writing this for Pepperminty Wiki, I'm settling on something not unlike this:

<generator uri="https://github.com/sbrl/Pepperminty-Wiki/" version="v0.18-dev">Pepperminty Wiki</generator>

In PHP, this translates to this:

$xml->startElement("generator");
$xml->writeAttribute("uri", "https://github.com/sbrl/Pepperminty-Wiki/");
$xml->writeAttribute("version", $version); // A variable defined elsewhere in Pepperminty Wiki
$xml->text("Pepperminty Wiki");
$xml->endElement();

Next, we need to add a <link rel="self" /> tag. This informs clients as to where the feed was fetched from, and the canonical URL of the feed. I've done this:

xml->startElement("link");
$xml->writeAttribute("rel", "self");
$xml->writeAttribute("type", "application/atom+xml");
$xml->writeAttribute("href", full_url());
$xml->endElement();

That full_url() function is from StackOverflow, and calculates the full URI that was used to make a request. As Pepperminty Wiki can be run in any directory on ayn server, I can't pre-determine this url - hence the complexity.

Note also that I output type="application/atom+xml" here. This specifies the type of content that can be found at the supplied URL. The idea here is that if you represent the same data in different ways, you can advertise them all in a standard way, with other formats having rel="alternate". Pepperminty Wiki does this - generating the recent changes list in HTML, CSV, and JSON in addition to the new Atom feed I'm blogging about here (the idea is to make the wiki data as accessible and easy-to-parse as possible). Let's advertise those too:

$xml->startElement("link");
$xml->writeAttribute("rel", "alternate");
$xml->writeAttribute("type", "text/html");
$xml->writeAttribute("href", "$full_url_stem?action=recent-changes&format=html");
$xml->endElement();

$xml->startElement("link");
$xml->writeAttribute("rel", "alternate");
$xml->writeAttribute("type", "application/json");
$xml->writeAttribute("href", "$full_url_stem?action=recent-changes&format=json");
$xml->endElement();

$xml->startElement("link");
$xml->writeAttribute("rel", "alternate");
$xml->writeAttribute("type", "text/csv");
$xml->writeAttribute("href", "$full_url_stem?action=recent-changes&format=csv");
$xml->endElement();

Before we can output the articles themselves, there are a few more pieces of metadata left on our laundry list - namely <updated />, <id />, <icon />, <title />, and <subtitle />. There are others in the documentation too, but aren't essential (as far as I can tell) - and not appropriate in this specific case. Here's what they might look like:

<updated>2019-02-02T21:23:43+00:00</updated>
<id>https://wiki.bobsrockets.com/?action=recent-changes&amp;format=atom</id>
<icon>https://wiki.bobsrockets.com/rocket_logo.png</icon>
<title>Bob's Wiki - Recent Changes</title>
<subtitle>Recent Changes on Bob's Wiki</subtitle>

The <updated /> tag specifies when the feed was last updated. It's unclear as to whether it's the date/time the last change was made to the feed or the date/time the feed was generated, so I've gone with the latter. If this isn't correct, please let me know and I'll change it.

The <id /> element can contain anything, but it must be a globally-unique string that identifies this feed. I've seen other feeds use the canonical url - and I've gone to the trouble of calculating it for the <link rel="self" /> - so it seems a shame to not use it here too.

The remaining elements (<icon />, <title />, and <subtitle />) are pretty self explanatory - although it's worth mentioning that the icon must be square apparently. Let's whip those up with some more PHP:

$xml->writeElement("updated", date(DateTime::ATOM));
$xml->writeElement("id", full_url());
$xml->writeElement("icon", $settings->favicon);
$xml->writeElement("title", "$settings->sitename - Recent Changes");
$xml->writeElement("subtitle", "Recent Changes on $settings->sitename");

PHP even has a present for generating a date string in the correct format required by the spec :D $settings is an object containing the wiki settings that's a parsed form of peppermint.json, and contains useful things like the wiki's name, icon, etc.

Finally, with all the preamble done, we can turn to the articles themselves. In the case of Pepperminty Wiki, the final result will look something like this:

<entry>
    <title type="text">Edit to Compute Core by Sean</title>
    <id>https://seanssatellites.co.uk/wiki/?page=Compute%20Core</id>
    <updated>2019-01-29T10:21:43+00:00</updated>
    <content type="html">&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Change type:&lt;/strong&gt; edit&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;User:&lt;/strong&gt;  Sean&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Page name:&lt;/strong&gt; Compute Core&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Timestamp:&lt;/strong&gt; Tue, 29 Jan 2019 10:21:43 +0000&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;New page size:&lt;/strong&gt; 1.36kb&lt;/li&gt;
        &lt;li&gt;&lt;strong&gt;Page size difference:&lt;/strong&gt; +1&lt;/li&gt;
&lt;/ul&gt;</content>
    <link rel="alternate" type="text/html" href="https://seanssatellites.co.uk/wiki/?page=Compute%20Core"/>
    <author>
        <name>Sean</name>
        <uri>https://seanssatellites.co.uk/wiki/?page=Users%2FSean</uri>
    </author>
</entry>

There are a bunch of elements here that deserve attention:

<title /> - The title of the article. Easy peasy!
<id /> - Just like the id of the feed itself, each article entry needs an id too. Here I've followed the same system I used for the feed, and given the url of the page content.
<updated /> - The last time the article was updated. Since this is part of a feed of recent changes, I've got this information readily at hand.
<content /> - The content to display. If the content is HTML, it must be escaped and type="html" present to indicate this.
<link rel="alternate" /> Same deal as above, but on an article-by-article level. In this case, it should link to the page the article content is from. In this case, I link to the page & revision of the change in question. In other cases, you might link to the blog post in question for example.
<author /> - Can contain <name />, <uri />, and <email />, and should indicate the author of the content. In this case, I use the name of the user that made the change, along with a link to their user page.

Here's all that in PHP:

foreach($recent_changes as $recent_change) {
    if(empty($recent_change->type))
        $recent_change->type = "edit";

    $xml->startElement("entry");

    // Change types: revert, edit, deletion, move, upload, comment
    $type = $recent_change->type;
    $url = "$full_url_stem?page=".rawurlencode($recent_change->page);

    $content = ".......";

    $xml->startElement("title");
    $xml->writeAttribute("type", "text");
    $xml->text("$type $recent_change->page by $recent_change->user");
    $xml->endElement();

    $xml->writeElement("id", $url);
    $xml->writeElement("updated", date(DateTime::ATOM, $recent_change->timestamp));

    $xml->startElement("content");
    $xml->writeAttribute("type", "html");
    $xml->text($content);
    $xml->endElement();

    $xml->startElement("link");
    $xml->writeAttribute("rel", "alternate");
    $xml->writeAttribute("type", "text/html");
    $xml->writeAttribute("href", $url);
    $xml->endElement();

    $xml->startElement("author");
    $xml->writeElement("name", $recent_change->user);
    $xml->writeElement("uri", "$full_url_stem?page=".rawurlencode("$settings->user_page_prefix/$recent_change->user"));
    $xml->endElement();

    $xml->endElement();
}

I've omitted the logic that generates the value of the <content /> tag, as it's not really relevant here (you can check it out here if you're curious :D).

This about finishes the XML we need to generate for our feed. To extract the XML from the XMLWriter, we can do this:

$atom_feed = $xml->flush();

Then we can do whatever we want to with the generated XML!

When the latest version of Pepperminty Wiki comes out, you'll be able to see a live demo here! Until then, you'll need to download a copy of the latest master version and experiment with it yourself. I'll also include a complete demo feed below:

<?xml version="1.0" encoding="UTF-8"?>
<feed xmlns="http://www.w3.org/2005/Atom">
    <generator uri="https://github.com/sbrl/Pepperminty-Wiki/" version="v0.18-dev">Pepperminty Wiki</generator>
    <link rel="self" type="application/atom+xml" href="http://[::]:35623/?action=recent-changes&amp;format=atom&amp;count=3"/>
    <link rel="alternate" type="text/html" href="http://[::]:35623/?action=recent-changes&amp;format=html"/>
    <link rel="alternate" type="application/json" href="http://[::]:35623/?action=recent-changes&amp;format=json"/>
    <link rel="alternate" type="text/csv" href="http://[::]:35623/?action=recent-changes&amp;format=csv"/>
    <updated>2019-02-03T17:25:10+00:00</updated>
    <id>http://[::]:35623/?action=recent-changes&amp;format=atom&amp;count=3</id>
    <icon>data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAABAAAAAQCAMAAAAoLQ9TAAAB3VBMVEXhERHbKCjeVVXjb2/kR0fhKirdHBziDg6qAADaHh7qLy/pdXXUNzfMAADYPj7ZPDzUNzfbHx/fERHpamrqMTHgExPdHx/bLCzhLS3fVFTjT0/ibm7kRkbiLi7aKirdISHeFBTqNDTpeHjgERHYJCTVODjYQkLaPj6/AADVOTnpbW3cIyPdFRXcJCThMjLiTU3ibW3fVVXaKyvcERH4ODj+8fH/////fHz+Fxf4KSn0UFD/CAj/AAD/Xl7/wMD/EhL//v70xMT/+Pj/iYn/HBz/g4P/IyP/Kyv/7Oz0QUH/9PT/+vr/ior/Dg7/vr7/aGj/QED/bGz/AQH/ERH/Jib/R0f/goL/0dH/qan/YWH/7e3/Cwv4R0f/MTH/enr/vLz/u7v/cHD/oKD/n5//aWn+9/f/k5P/0tL/trb/QUH/cXH/dHT/wsL/DQ3/p6f/DAz/1dX/XV3/kpL/i4v/Vlb/2Nj/9/f/pKT+7Oz/V1f/iIj/jIz/r6//Zmb/lZX/j4//T0//Dw/4MzP/GBj/+fn/o6P/TEz/xMT/b2//Tk7/OTn/HR3/hIT/ODj/Y2P/CQn/ZGT/6Oj0UlL/Gxv//f3/Bwf/YmL/6+v0w8P/Cgr/tbX0QkL+9fX4Pz/qNzd0dFHLAAAAAXRSTlMAQObYZgAAAAFiS0dEAIgFHUgAAAAJcEhZcwAACxMAAAsTAQCanBgAAAAHdElNRQfeCxINNSdmw510AAAA5ElEQVQYGQXBzSuDAQCA8eexKXOwmSZepa1JiPJxsJOrCwcnuchBjg4O/gr7D9zk4uAgJzvuMgcTpYxaUZvSm5mUj7TX7ycAqvoLIJBwStVbP0Hom1Z/ejoxrbaR1Jz6nWinbKWttGRgMSSjanPktRY6mB9WtRNTn7Ilh7LxnNpKq2/x5LnBitfz+hx0qxUaxhZ6vwqq9bx6f2XXvuUl9SVQS38NR7cvln3v15tZ9bQpuWDtZN3Lgh5DWJex3Y+z1KrVhw21+CiM74WZo83DiXq0dVBDYNJkFEU7WrwDAZhRtQrwDzwKQbT6GboLAAAAAElFTkSuQmCC</icon>
    <title>Pepperminty Wiki - Recent Changes</title>
    <subtitle>Recent Changes on Pepperminty Wiki</subtitle>
    <entry>
        <title type="text">Edit to Internal link by admin</title>
        <id>http://[::]:35623/?page=Internal%20link</id>
        <updated>2019-01-29T19:55:08+00:00</updated>
        <content type="html">&lt;ul&gt;
    &lt;li&gt;&lt;strong&gt;Change type:&lt;/strong&gt; edit&lt;/li&gt;
    &lt;li&gt;&lt;strong&gt;User:&lt;/strong&gt;  admin&lt;/li&gt;
    &lt;li&gt;&lt;strong&gt;Page name:&lt;/strong&gt; Internal link&lt;/li&gt;
    &lt;li&gt;&lt;strong&gt;Timestamp:&lt;/strong&gt; Tue, 29 Jan 2019 19:55:08 +0000&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;New page size:&lt;/strong&gt; 2.11kb&lt;/li&gt;
            &lt;li&gt;&lt;strong&gt;Page size difference:&lt;/strong&gt; +2007&lt;/li&gt;
&lt;/ul&gt;</content>
        <link rel="alternate" type="text/html" href="http://[::]:35623/?page=Internal%20link"/>
        <author>
            <name>admin</name>
            <uri>http://[::]:35623/?page=Users%2FInternal%20link</uri>
        </author>
    </entry>
    <entry>
        <title type="text">Edit to Main Page by admin</title>
        <id>http://[::]:35623/?page=Main%20Page</id>
        <updated>2019-01-05T20:14:07+00:00</updated>
        <content type="html">&lt;ul&gt;
    &lt;li&gt;&lt;strong&gt;Change type:&lt;/strong&gt; edit&lt;/li&gt;
    &lt;li&gt;&lt;strong&gt;User:&lt;/strong&gt;  admin&lt;/li&gt;
    &lt;li&gt;&lt;strong&gt;Page name:&lt;/strong&gt; Main Page&lt;/li&gt;
    &lt;li&gt;&lt;strong&gt;Timestamp:&lt;/strong&gt; Sat, 05 Jan 2019 20:14:07 +0000&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;New page size:&lt;/strong&gt; 317b&lt;/li&gt;
            &lt;li&gt;&lt;strong&gt;Page size difference:&lt;/strong&gt; +68&lt;/li&gt;
&lt;/ul&gt;</content>
        <link rel="alternate" type="text/html" href="http://[::]:35623/?page=Main%20Page"/>
        <author>
            <name>admin</name>
            <uri>http://[::]:35623/?page=Users%2FMain%20Page</uri>
        </author>
    </entry>
    <entry>
        <title type="text">Edit to Main Page by admin</title>
        <id>http://[::]:35623/?page=Main%20Page</id>
        <updated>2019-01-05T17:53:08+00:00</updated>
        <content type="html">&lt;ul&gt;
    &lt;li&gt;&lt;strong&gt;Change type:&lt;/strong&gt; edit&lt;/li&gt;
    &lt;li&gt;&lt;strong&gt;User:&lt;/strong&gt;  admin&lt;/li&gt;
    &lt;li&gt;&lt;strong&gt;Page name:&lt;/strong&gt; Main Page&lt;/li&gt;
    &lt;li&gt;&lt;strong&gt;Timestamp:&lt;/strong&gt; Sat, 05 Jan 2019 17:53:08 +0000&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;New page size:&lt;/strong&gt; 249b&lt;/li&gt;
            &lt;li&gt;&lt;strong&gt;Page size difference:&lt;/strong&gt; +31&lt;/li&gt;
&lt;/ul&gt;</content>
        <link rel="alternate" type="text/html" href="http://[::]:35623/?page=Main%20Page"/>
        <author>
            <name>admin</name>
            <uri>http://[::]:35623/?page=Users%2FMain%20Page</uri>
        </author>
    </entry>
</feed>

...this is from my local development instance.

Found this interesting? Confused about something? Want to say hi? Comment below!

Generating word searches for fun and profit

(Above: A Word search generated with the tool below)

A little while ago I was asked about generating a wordsearch in a custom shape. I thought to myself "someone has to have built this before...", and while I was right to an extent, I couldn't find one that let you use any shape you liked.

This, of course, was unacceptable! You've probably guessed it by now, but I went ahead and wrote my own :P

While I wrote it a little while ago, I apparently never got around to posting about it on here.

In short, it works by using an image you drop into the designated area on the page as the shape the word search should take. Each pixel is a single cell of the word search - with the alpha channel representing whether or not a character is allowed to be placed there (transparent means that it can't contain a character, and opaque means that it can).

Creating such an image is simple. Personally, I recommend Piskel or GIMP for this purpose.

Once done, you can start building a wordlist in the wordlist box at the right-hand-side. It should rebuild the word search as soon as you click out of the box. If it doesn't, then you've found a bug! Please report it here.

With the word search generated, you can use the Question Sheet and Answer Sheet links to open printable versions for export.

You can find my word search generator here:

WordsearchPlus

I've generated a word search of the current tags in the tag cloud on this blog too: Question Sheet [50.3KiB], Answer Sheet [285.6KiB]

The most complicated part of this was probably the logistics behind rude word removal. Thankfully, I did't have to find and maintain such a list of words, as the futility npm package does this for me, but algorithmically guaranteeing that by censoring 1 rude word another is not accidentally created in another direction is a nasty problem.

If you're interested in a more technical breakdown of one (or several!) particular aspects of this - let me know! While writing about all of it would probably make for an awfully long post, a specific aspect or two should be more manageable.

In the future, I'll probably revisit this and add additional features to it, such as the ability to restrict which directions words are placed in, for example. If you've got a suggestion of your own, open an issue (or even better, open a pull request :D)!

Demystifying Inverted Indexes

(The magnifying glass in the above banner came from openclipart)

After writing the post that will be released after this one, I realised that I made a critical assumption that everyone knew what an inverted index was. Upon looking for an appropriate tutorial online, I couldn't find one that was close enough to what I did in Pepperminty Wiki, so I decided to write my own.

First, some context. What's Pepperminty Wiki? Well, it's a complete wiki engine in a single file of PHP. The source files are obviously not a single file, but it builds into a single file - making it easy to drop into any PHP-enabled web server.

One of its features is a full-text search engine. A personal wiki of mine has ~75k words spread across ~550 pages, and it manages to search them all in just ~450ms! It does this with the aid of an inverted index - which I'll be explaining in this post.

First though, we need some data to index. How about the descriptions of some video games?

Kerbal Space Program

In KSP, you must build a space-worthy craft, capable of flying its crew out into space, without killing them. At your disposal is a collection of parts, which must be assembled to create a functional ship. Each part has its own function and will affect the way a ship flies (or doesn't). So strap yourself in, and get ready to try some Rocket Science!

Cross Code

Meet Lea as she logs into an MMO of the distant future. Follow her steps as she discovers a vast world, meets other players and overcomes all the challenges of the game.

Fort Meow

Fort Meow is a physics game by Upper Class Walrus about a girl, an old book and a house full of cats! Meow.

Factory Balls

Factory balls is the brand new bonte game I already announced yesterday. Factory balls takes part in the game design competition over at jayisgames. The goal of the design competition was to create a 'ball physics'-themed game. I hope you enjoy it!

Very cool, this should provide us with plenty of data to experiment with. Firstly, let's consider indexing. Take the Factory Balls description. We can split it up into tokens like this:

T	o	k	e	n	s	V	V
factory	balls	is	the	brand	new	bonte	game
i	already	announced	yesterday	factory	balls	takes
part	in	the	game	design	competition	over
at	jayisgames	the	goal	of	the	design
competition	was	to	create	a	ball	physics
themed	game	i	hope	you	enjoy	it

Notice how we've removed punctuation here, and made everything lowercase. This is important for the next step, as we want to make sure we consider Factory and factory to be the same word - otherwise when querying the index we'd have to remember to get the casing correct.

With our tokens sorted, we can now count them to create our index. It's like a sort of tally chart I guess, except we'll be including the offset in the text of every token in the list. We'll also be removing some of the most common words in the list that aren't very interesting - these are known as stop words. Here's an index generated from that Factory Balls text above:

Token	Frequency	Offsets
factory	2	0, 12
balls	2	1, 13
brand	1	4
new	1	5
bonte	1	6
game	3	7, 18, 37
i	2	8, 38
announced	1	10
yesterday	1	11
takes	1	14
design	2	19, 28
competition	2	20, 29
jayisgames	1	23
goal	1	25
create	1	32
ball	1	34
physics	1	35
themed	1	36
hope	1	39
enjoy	1	41

Very cool. Now we can generate an index for each page's content. The next step is to turn this into an inverted index. Basically, the difference between the normal index and a inverted index is that an entry in an inverted index contains not just the offsets for a single page, but all the pages that contain that token. For example, the Cross-Code example above also contains the token game, so the inverted index entry for game would contain a list of offsets for both the Factory Balls and Cross-Code pages.

Including the names of every page under every different token in the inverted index would be both inefficient computationally and cause the index to grow rather large, so we should assign each page a unique numerical id. Let's do that now:

Id	Page Name
1	Kerbal Space Program
2	Cross Code
3	Fort Meow
4	Factory Balls

There - much better. In Pepperminty Wiki, this is handled by the ids class, which has a pair of public methods: getid($pagename) and getpagename($id). If an id can't be found for a page name, then a new id is created and added to the list (Pepperminty Wiki calls this the id index) transparently. Similarly, if a page name can't be found for an id, then null should be returned.

Now that we've got ids for our pages, let's look at generating that inverted index entry for game we talked about above. Here it is:

Term: game

Id	Frequency	Offsets
2	1	31
3	1	5
4	3	5, 12, 23

Note how there isn't an entry for page id 1, as the Kerbal Space Program page doesn't contain the token game.

This, in essence, is the basics of inverted indexes. A full inverted index will contain an entry for every token that's found in at least 1 source document - though the approach used here is far from the only way of doing it (I'm sure there are much more advanced ways of doing it for larger datasets, but this came to mind from reading a few web articles and is fairly straight-forward and easy to understand).

Can you write a program that generates a full inverted index like I did in the example above? Try testing it on the test game descriptions at the start of this post.

You may also have noticed that the offsets used here are of the tokens in the list. If you wanted to generate contexts (like Duck Duck Go or Google do just below the title of a result), you'd need to use the character offsets from the source document instead. Can you extend your program to support querying the inverted index, generating contexts based on the inverted index too?

Liked this post? Got your own thoughts on the subject? Having trouble with the challenges at the end? Comment below!

Quest Get: Search large amounts of code!

A map of the Linux Kernel source code.

(Above: A map of the Linux Kernel source code. Source: this post on medium.)

Recently I was working on a little project of mine (nope, not this for once! :P), and I needed a C♯ class I'd written a while ago. Being forgetful as I am, I had no idea which of my project I'd written it for. And so the quest began to find it! I did in the end, but it left me thinking whether there was a better way to search all my code quickly. This post is the culmination of everything I've discovered so far about the process of searching one's code.

Before I started, I already know about grep, which is built into almost every Linux system around. It's even available for Windows via the MSYS Tools. Unfortunately though, despite it's prevailance, it's not particularly good at searching large numbers of git repositories, as it keeps descending into the .git folder and displaying a whole load of useless results.

Something had to change. After asking reddit, I was introduced to OpenGrok. Written in Java, it indexes all of your code, and provides a web interface through which you can search it. Very nice. Unfortunately, I had trouble figuring out the logistics of actually getting it to run - and discovered that it takes multiple hours to set up correctly.

Moving on, I was re-introduced to ack, written in plain-old Perl, it apparently runs practically any system that Perl does - though it's not installed by default like grep is. Looking into it, I found it to be much like grep - only smarter. It ignores version control directories (like the .git folder ), and common package folders (like node_modules) by default, and even has a system by which results can be filtered by language (with support for hash-bangs too!). The results themselves are coloured by default - making it easy to skim through quickly. Coupled with the flexible configuration file system, ack makes for a wonderfully flexible way to search through large amounts of code quickly.

Though ack looks good, I still didn't have a way to search through all my code that scattered across multiple devices at once, so I kept looking. The next project I found (through alternative to actually) was Text Sherlock. It positions itself as an alternative to OpenGrok that's much simpler to configure.

True to its word, I managed to get a test instance set up running from my /tmp directory in 15 minutes - though it did take a while to index the code I had locally. It also took several seconds to consult its index when I entered a query. I suspect I could alleviate both of these issues by installing Xapian (an open-source high-performance search library), which it appears to have support for.

While the interface was cool, it didn't appear to allow me to tell it which directories not to index, so it ended trawling through all my .git directories - just like grep did. It also doesn't appear to multi-threaded - so it took much longer to index my code than it really needed to (I've got a solid-state drive and enough RAM for a few GBs of cache, so the indexing operation was CPU-bound, not I/O-bound).

In the end, I've rediscovered the awesome search tool ack, and taken a look at the current state of code search tools today. While I haven't yet found precisely what I'm looking for, I'm further forward than when I started.

Other honourable mentions include GNU Global (which apparently needs several GiBs per ~300MiB of source code for its generated static HTML web interface), insight.io (an IDE-like freemium cloud product that 'understands your code'), CodeQuery (only supports C, C++, Java, Python, Ruby, Javascript, and Go), and ripgrep (rust-based program, similar to ack and grep, feature comparison). The official ack website has a good page that contains more tools that are worth a look, too.

Got a cool way to search through all your code? Did this help you out? Comment below!

Building a line-by-line lexer in C#

So there I was. It was a lazy afternoon before my final exam of the semester, and I was idly looking through some old code. One thing led to another, and I ended up writing a line-based scanning lexer in C# - and I thought I'd share it here, pulling it apart and putting it back together again.

The aim was to build something regular expression based, that would be flexible enough that it could be used in a wide-range of applications - but not too difficult or confusing to pick up, use and add to another project. The final code isn't currently available in a public repository (it's actually for a personal project - maybe I'll get around to refactoring it into a library if there's the demand), but I'll still post the finished code at the end of this post.

To start with, let's define some data classes to hold some input and output information. Hrm. Let's see - we'll need a class to represent a rule that the lexer will utilise, and one to represent the tokens that our lexer will be emitting. Let's deal with the one for the rules first:

public class LexerRule<TokenType>
{
    public readonly TokenType Type;
    public readonly Regex RegEx;
    public bool Enabled { get; set; } = true;

    public LexerRule(TokenType inName, string inRegEx)
    {
        if (!typeof(TokenType).IsEnum)
            throw new ArgumentException($"Error: inName must be an enum - {typeof(TokenType)} passed");

        Type = inName;
        RegEx = new Regex(inRegEx);
    }

    public bool Toggle()
    {
        Enabled = !Enabled;
        return Enabled;
    }
}

Here I define a template (or generic) class that holds a regular expression, and associates it with a value from an enum. There's probably a better / cleaner way to make sure that TokenType is an enum, but for now this should serve it's purpose just fine. I also add a simple Enabled boolean property - as we'll be adding support for dynamically enabling and disabling rules later on.

Next up, let's tackle the class for the tokens that we're going to be emitting:

public class LexerToken<TokenType>
{
    public readonly bool IsNullMatch = false;
    public readonly LexerRule<TokenType> Rule = null;
    public readonly Match RegexMatch;

    public TokenType Type {
        get {
            try {
                return Rule.Type;
            }
            catch (NullReferenceException) {
                return default(TokenType);
            }
        }
    }
    private string nullValueData;
    public string Value {
        get {
            return IsNullMatch ? nullValueData : RegexMatch.Value;
        }
    }

    public LexerToken(LexerRule<TokenType> inRule, Match inMatch)
    {
        Rule = inRule;
        RegexMatch = inMatch;
    }
    public LexerToken(string unknownData)
    {
        IsNullMatch = true;
        nullValueData = unknownData;
    }

    public override string ToString()
    {
        return string.Format("[LexerToken: Type={0}, Value={1}]", Type, Value);
    }
}

A little more complex, but still manageable. It, like it's LexerRule cousin, is also a template (or generic) class. It holds the type of token it is and the regular expression Match object generated during the scanning process. It also has something strange going on with Value and nullValueData - this is such that we can emit tokens with an 'unknown' type (more on that later) for the text in between that doesn't match any known rule. We'll be covering this later too.

With our data classes in place, it's time to turn our attention to the lexer itself. Let's put together some scaffolding to get an idea as to how it's going to work:

public class Lexer<TokenType>
{
    public List<LexerRule<TokenType>> Rules { get; private set; } = new List<LexerRule<TokenType>>();

    public int CurrentLineNumber { get; private set; } = 0;
    public int CurrentLinePos { get; private set; } = 0;
    public int TotalCharsScanned { get; private set; } = 0;

    private StreamReader textStream;

    public Lexer()
    {

    }

    public void AddRule(LexerRule<TokenType> newRule);
    public void AddRules(IEnumerable<LexerRule<TokenType>> newRules);

    public void Initialise(StreamReader reader);

    public IEnumerable<LexerToken<TokenType>> TokenStream();

    public void EnableRule(TokenType type);
    public void DisableRule(TokenType type);
    public void SetRule(TokenType type, bool state);
}

There - that should do the trick! CurrentLineNumber, CurrentLinePos, and TotalCharsScanned are properties to keep track of where we've got to, and textStream is the StreamReader we'll be reading data from. Then, we've got some methods that will add new lexer rules to Rules enable and disable rules by token type, a method to initialise the lexer with the correct textStream, and finally a generator method that will emit the tokens.

With our skeleton complete, let's fill out a few of those methods:

public void AddRule(LexerRule<TokenType> newRule)
{
    Rules.Add(newRule);
}
public void AddRules(IEnumerable<LexerRule<TokenType>> newRules)
{
    Rules.AddRange(newRules);
}

public void Initialise(StreamReader reader)
{
    textStream = reader;
}

public void EnableRule(TokenType type)
{
    SetRule(type, true);
}
public void DisableRule(TokenType type)
{
    SetRule(type, false);
}
public void SetRule(TokenType type, bool state)
{
    foreach (LexerRule<TokenType> rule in Rules)
    {
        // We have to do a string comparison here because of the generic type we're using in multiple nested
        // classes
        if (Enum.GetName(rule.Type.GetType(), rule.Type) == Enum.GetName(type.GetType(), type)) {
            rule.Enabled = state;
            return;
        }
    }
}

Very cool. None of this is particularly exciting - apart from SetBody. In SetBody we have to convert the type argument ot a string in order to compare it to the rules in the Rules list, as C♯ doesn't seem to understand that the TokenType on the LexerRule class is the same as the TokenType on the Lexer class - even though they have the same name! This did give me an idea for a trio of additional methods to make manipulating rules easier though:

public void EnableRulesByPrefix(string tokenTypePrefix)
{
    SetRulesByPrefix(tokenTypePrefix, true);
}
public void DisableRulesByPrefix(string tokenTypePrefix)
{
    SetRulesByPrefix(tokenTypePrefix, false);
}
public void SetRulesByPrefix(string tokenTypePrefix, bool state)
{
    foreach (LexerRule<TokenType> rule in Rules)
    {
        // We have to do a string comparison here because of the generic type we're using in multiple nested
        // classes
        if (Enum.GetName(rule.Type.GetType(), rule.Type).StartsWith(tokenTypePrefix, StringComparison.CurrentCulture))
        {
            rule.Enabled = state;
        }
    }
}

This set of methods let us enable or disable rules based on what they are with. For example, if I have the 3 rules CommentStart, CommentEnd, and FunctionStart, then calling EnableRulesByPrefix("Comment") will enable CommentStart and CommentEnd, but not FunctionStart.

With all the interfacing code out of the way, let's turn tot he real meat of the subject: The TokenStream method. This is the method behind the magic. It's a bit complicated, so let's take it step-by-step. Firstly, we need to iterate over the lines in the StreamReader:

string nextLine;
List<LexerToken<TokenType>> matches = new List<LexerToken<TokenType>>();
while ((nextLine = textStream.ReadLine()) != null)
{
    CurrentLinePos = 0;

    // .....

    CurrentLineNumber++;
    TotalCharsScanned += CurrentLinePos;
}

Fairly simple, right? I've used this construct a few times in the past. Before you ask, we'll get to matches in just a moment :-) Next, we need another while loop that iterates until we reach the end of the line:

while (CurrentLinePos < nextLine.Length)
{
    matches.Clear();
    foreach (LexerRule<TokenType> rule in Rules) {
        if (!rule.Enabled) continue;

        Match nextMatch = rule.RegEx.Match(nextLine, CurrentLinePos);
        if (!nextMatch.Success) continue;

        matches.Add(new LexerToken<TokenType>(rule, nextMatch));
    }

    // .....
}

Also fairly easy to follow. Basically, we clear the matches list, and then attempt to find the next match from the current position on the line that we've reached (CurrentLinePos) for every rule - and we store all the successful matches for further inspection and processing. We also make sure we skip any disabled rules here, too.

If we don't find any matching rules, then that must mean that we can't match the rest of this line to any known token. In this case, we want to emit an unknown token:

if (matches.Count == 0) {
    yield return new LexerToken<TokenType>(nextLine.Substring(CurrentLinePos));
    break;
}

This is what that extra LexerToken constructor is for that we created above. Note that we yield return here, instead of simply returning - this is very similar in construct to the yield statement in Javascript that I blogged about before (and again here), in that they allow you to maintain state inside a method and return multiple values in sequence.

By an 'unknown token', I am referring to the default value of the TokenType enum. Here's an example enum you might use with this lexer class:

public enum SimpleToken {
    Unknown = 0,

    Whitespace,

    Comment,
    BlockOpen,
    BlockClose,
}

Since the value Unknown is explicitly assigned the index 0, we can be absolutely certain that it's the default value of this enum.

With our list of potential matches in hand, we next need to sort it in order to work our which one we should prioritise. After deliberating and experimenting a bit, I came up with this:

matches.Sort((LexerToken<TokenType> a, LexerToken<TokenType> b) => {
    // Match of offset position position
    int result = nextLine.IndexOf(a.RegexMatch.Value, CurrentLinePos, StringComparison.CurrentCulture) -
                nextLine.IndexOf(b.RegexMatch.Value, CurrentLinePos, StringComparison.CurrentCulture);
    // If they both start at the same position, then go with the longest one
    if (result == 0)
        result = b.RegexMatch.Length - a.RegexMatch.Length;

    return result;
});

Basically, this sorts them so that the matches closest to the current scanning position on the list (CurrentLinePos) are prioritised. If 2 or more matches tie based on this criterion, we prioritise the longest match. This seems to work for now - I can always change it later if it becomes a problem :P

With our matches sorted, we can now pick our the one we're going t emit next. Before we do so though, we need to take care of any characters between our current scanning position and the start of the next token:

LexerToken<TokenType> selectedToken = matches[0];
int selectedTokenOffset = nextLine.IndexOf(selectedToken.RegexMatch.Value, CurrentLinePos) - CurrentLinePos;

if (selectedTokenOffset > 0) {
    CurrentLinePos += selectedTokenOffset;
    yield return new LexerToken<TokenType>(nextLine.Substring(CurrentLinePos, selectedTokenOffset));
}

This emits these additional characters as an unknown token as we did before. Finally, we can emit the token and continue onto the next iteration of the loop:

CurrentLinePos += selectedToken.RegexMatch.Length;
yield return selectedToken;

That concludes our TokenStream method - and with it this lexer! Here's the code in full:

using System;
using System.Collections.Generic;
using System.IO;
using System.Text.RegularExpressions;

namespace SBRL.Tools
{
    public class LexerRule<TokenType>
    {
        public readonly TokenType Type;
        public readonly Regex RegEx;
        public bool Enabled { get; set; } = true;

        public LexerRule(TokenType inName, string inRegEx)
        {
            if (!typeof(TokenType).IsEnum)
                throw new ArgumentException($"Error: inName must be an enum - {typeof(TokenType)} passed");

            Type = inName;
            RegEx = new Regex(inRegEx);
        }

        public bool Toggle()
        {
            Enabled = !Enabled;
            return Enabled;
        }
    }

    public class LexerToken<TokenType>
    {
        public readonly bool IsNullMatch = false;
        public readonly LexerRule<TokenType> Rule = null;
        public readonly Match RegexMatch;

        public TokenType Type {
            get {
                try {
                    return Rule.Type;
                }
                catch (NullReferenceException) {
                    return default(TokenType);
                }
            }
        }
        private string nullValueData;
        public string Value {
            get {
                return IsNullMatch ? nullValueData : RegexMatch.Value;
            }
        }

        public LexerToken(LexerRule<TokenType> inRule, Match inMatch)
        {
            Rule = inRule;
            RegexMatch = inMatch;
        }
        public LexerToken(string unknownData)
        {
            IsNullMatch = true;
            nullValueData = unknownData;
        }

        public override string ToString()
        {
            return string.Format("[LexerToken: Type={0}, Value={1}]", Type, Value);
        }
    }

    public class Lexer<TokenType>
    {
        public List<LexerRule<TokenType>> Rules { get; private set; } = new List<LexerRule<TokenType>>();

        public bool Verbose { get; set; } = false;

        /// <summary>
        /// The number of the line that currently being scanned.
        /// </summary>
        public int CurrentLineNumber { get; private set; } = 0;
        /// <summary>
        /// The number of characters on the current line that have been scanned.
        /// </summary>
        /// <value>The current line position.</value>
        public int CurrentLinePos { get; private set; } = 0;
        /// <summary>
        /// The total number of characters currently scanned by this lexer instance.
        /// Only updated every newline!
        /// </summary>
        public int TotalCharsScanned { get; private set; } = 0;

        private StreamReader textStream;

        public Lexer()
        {

        }

        public void AddRule(LexerRule<TokenType> newRule)
        {
            Rules.Add(newRule);
        }
        public void AddRules(IEnumerable<LexerRule<TokenType>> newRules)
        {
            Rules.AddRange(newRules);
        }

        public void Initialise(StreamReader reader)
        {
            textStream = reader;
        }

        public IEnumerable<LexerToken<TokenType>> TokenStream()
        {
            string nextLine;
            List<LexerToken<TokenType>> matches = new List<LexerToken<TokenType>>();
            while ((nextLine = textStream.ReadLine()) != null)
            {
                CurrentLinePos = 0;

                while (CurrentLinePos < nextLine.Length)
                {
                    matches.Clear();
                    foreach (LexerRule<TokenType> rule in Rules) {
                        if (!rule.Enabled) continue;

                        Match nextMatch = rule.RegEx.Match(nextLine, CurrentLinePos);
                        if (!nextMatch.Success) continue;

                        matches.Add(new LexerToken<TokenType>(rule, nextMatch));
                    }

                    if (matches.Count == 0) {
                        string unknownTokenContent = nextLine.Substring(CurrentLinePos);
                        if(Verbose) Console.WriteLine("[Unknown Token: No matches found for this line] {0}", unknownTokenContent);
                        yield return new LexerToken<TokenType>(unknownTokenContent);
                        break;
                    }

                    matches.Sort((LexerToken<TokenType> a, LexerToken<TokenType> b) => {
                        // Match of offset position position
                        int result = nextLine.IndexOf(a.RegexMatch.Value, CurrentLinePos, StringComparison.CurrentCulture) -
                                    nextLine.IndexOf(b.RegexMatch.Value, CurrentLinePos, StringComparison.CurrentCulture);
                        // If they both start at the same position, then go with the longest one
                        if (result == 0)
                            result = b.RegexMatch.Length - a.RegexMatch.Length;

                        return result;
                    });
                    LexerToken<TokenType> selectedToken = matches[0];
                    int selectedTokenOffset = nextLine.IndexOf(selectedToken.RegexMatch.Value, CurrentLinePos) - CurrentLinePos;

                    if (selectedTokenOffset > 0) {
                        string extraTokenContent = nextLine.Substring(CurrentLinePos, selectedTokenOffset);
                        CurrentLinePos += selectedTokenOffset;
                        if(Verbose) Console.WriteLine("[Unmatched content] '{0}'", extraTokenContent);
                        yield return new LexerToken<TokenType>(extraTokenContent);
                    }

                    CurrentLinePos += selectedToken.RegexMatch.Length;
                    if(Verbose) Console.WriteLine(selectedToken);
                    yield return selectedToken;
                }

                if(Verbose) Console.WriteLine("[Lexer] Next line");
                CurrentLineNumber++;
                TotalCharsScanned += CurrentLinePos;
            }
        }

        public void EnableRule(TokenType type)
        {
            SetRule(type, true);
        }
        public void DisableRule(TokenType type)
        {
            SetRule(type, false);
        }
        public void SetRule(TokenType type, bool state)
        {
            foreach (LexerRule<TokenType> rule in Rules)
            {
                // We have to do a string comparison here because of the generic type we're using in multiple nested
                // classes
                if (Enum.GetName(rule.Type.GetType(), rule.Type) == Enum.GetName(type.GetType(), type)) {
                    rule.Enabled = state;
                    return;
                }
            }
        }

        public void EnableRulesByPrefix(string tokenTypePrefix)
        {
            SetRulesByPrefix(tokenTypePrefix, true);
        }
        public void DisableRulesByPrefix(string tokenTypePrefix)
        {
            SetRulesByPrefix(tokenTypePrefix, false);
        }
        public void SetRulesByPrefix(string tokenTypePrefix, bool state)
        {
            foreach (LexerRule<TokenType> rule in Rules)
            {
                // We have to do a string comparison here because of the generic type we're using in multiple nested
                // classes
                if (Enum.GetName(rule.Type.GetType(), rule.Type).StartsWith(tokenTypePrefix, StringComparison.CurrentCulture))
                {
                    rule.Enabled = state;
                }
            }
        }
    }
}

It's a bit much to take in all at once, but hopefully by breaking it down into steps I've made it easier to understand how I built it, and how all the different pieces fit together. The only real difference between the above code and the code I walked through in this post is the Verbose parameter I added for testing purposes, and the associated Console.WriteLine calls. For fun, here's a very basic LOGO (also here) lexer. I've based it on what I remember from using MSWLogo / FMSLogo a long time ago (there seem to be many dialects around these days):

public enum LogoToken
{
    Unknown = 0,

    Whitespace,

    FunctionForwards,
    FunctionBackwards,
    FunctionLeft,
    FunctionRight,
    FunctionPenUp,
    FunctionPenDown,

    Number
}

public class LogoLexer : Lexer<LogoToken>
{
    public LogoLexer()
    {
        AddRules(new List<LexerRule<LogoToken>>() {
            new LexerRule<LogoToken>(LogoToken.Whitespace,    @"\s+"),

            new LexerRule<LogoToken>(LogoToken.FunctionForwards,    @"FD"),
            new LexerRule<LogoToken>(LogoToken.FunctionBackwards,    @"BK"),
            new LexerRule<LogoToken>(LogoToken.FunctionLeft,    @"LT"),
            new LexerRule<LogoToken>(LogoToken.FunctionRight,    @"RT"),

            new LexerRule<LogoToken>(LogoToken.FunctionPenUp,    @"PU"),
            new LexerRule<LogoToken>(LogoToken.FunctionPenDown,    @"PD"),

            new LexerRule<LogoToken>(LogoToken.Number,    @"\d+"),
        });
    }
}

Here's an example LOGO program that it parses:

FD 100 RT 90
FD 50 PU RT 180
BK 40 LT 45 FD 250

...and here's the output from lexing that example program:

[LexerToken: Type=FunctionForwards, Value=FD]
[LexerToken: Type=Whitespace, Value= ]
[LexerToken: Type=Number, Value=100]
[LexerToken: Type=Whitespace, Value= ]
[LexerToken: Type=FunctionRight, Value=RT]
[LexerToken: Type=Whitespace, Value= ]
[LexerToken: Type=Number, Value=90]
[Lexer] Next line
[LexerToken: Type=FunctionForwards, Value=FD]
[LexerToken: Type=Whitespace, Value= ]
[LexerToken: Type=Number, Value=50]
[LexerToken: Type=Whitespace, Value= ]
[LexerToken: Type=FunctionPenUp, Value=PU]
[LexerToken: Type=Whitespace, Value= ]
[LexerToken: Type=FunctionRight, Value=RT]
[LexerToken: Type=Whitespace, Value= ]
[LexerToken: Type=Number, Value=180]
[Lexer] Next line
[LexerToken: Type=FunctionBackwards, Value=BK]
[LexerToken: Type=Whitespace, Value= ]
[LexerToken: Type=Number, Value=40]
[LexerToken: Type=Whitespace, Value= ]
[LexerToken: Type=FunctionLeft, Value=LT]
[LexerToken: Type=Whitespace, Value= ]
[LexerToken: Type=Number, Value=45]
[LexerToken: Type=Whitespace, Value= ]
[LexerToken: Type=FunctionForwards, Value=FD]
[LexerToken: Type=Whitespace, Value= ]
[LexerToken: Type=Number, Value=250]
[Lexer] Next line
[Lexer] Next line

Very cool! This could easily be extended to support more of the LOGO syntax. As an exercise, can you extend it to support the REPEAT statement? At some point in the future, I might go even further and build a bottom-up left-to-right shift-reduce parser, and combine it with this lexer and some BNF to create a parse tree.

Enjoyed this post? Don't quite understand something? Think you could do better? Post a comment below!

GlidingSquirrel is now on NuGet with automatic API documentation!

(This post is a bit late as I'm rather under the weather.)

A while ago, I posted about a HTTP/1.0 server library I'd written in C♯ for a challenge. One thing led to another, and I've now ended up adding support for HTTP/1.1, WebSockets, and more....

While these enhancements have been driven mainly by a certain project of mine that keeps stealing all of my attention, they've been fun to add - and now that I've (finally!) worked out how to package it as a NuGet package, it's available on NuGet for everyone to use, under the Mozilla Public License 2.0!

Caution should be advised though, as I've not thoroughly tested it yet to weed out the slew of bugs and vulnerabilities I'm sure are hiding in there somewhere - it's designed mainly to sit behind a reverse-proxy such as Nginx (not that's it's any excuse, I know!) - and also I don't have a good set of test cases I can check it against (I thought for sure there would be at least one test suite out there for HTTP servers, considering the age of the protocol, but apparently not!).

With that out of the way, specifying dependencies didn't turn out to be that hard, actually. Here's the extra bit I added to the .nuspec file to specify the GlidingSquirrel's dependencies:

<dependencies>
    <dependency id="MimeSharp" version="1.0.0" />
    <dependency id="System.ValueTuple" version="4.4.0" />
</dependencies>

The above goes inside the <metadata> element. If you're curious about what this strange .nuspec file is and it's format, I've blogged about it a while back when I figured out how to package my TeleConsole Client - in which I explain my findings and how you can do it yourself too!

I've still not figured out the strange $version$ and $description$ things that are supposed to reference the appropriate values in the .csproj file (if you know how those work, please leave a comment below, as I'd love to know!) - as it keeps telling me that <description> cannot be empty.....

In addition to packaging it and putting it up on NuGet, I've also taken a look at automatic documentation generation. For a while I've been painstakingly documenting the entire project with intellisense comments for especially this purpose - which I'm glad I started early, as it's been a real pain and taken a rather decent chunk of time to do.

Anyway, with the last of the intellisense comments finished today, I set about generating automatic documentation - which turned out to be a surprisingly difficult thing to achieve.

Sandcastle, Microsoft's offering has been helpfully discontinued.
Sandcastle Help File Builder, the community open-source fork of Sandcastle, appears to be Windows-only (I use Linux myself).
DocFX, Microsoft's new offering, appears to be more of a static site generator that uses your code and a bunch of other things as input, rather than the simple HTML API documentation generator I was after.
I found Docxygen to produce an acceptable result out-of-the-box, I discovered that configuring it was not a trivial task - the initial configuration file the init action created was over 100 KiB!

I found a few other options too, but they were either Windows-only, or commercial offerings that I can't justify using for an open-source project.

When I was about to give up the search for the day, I stumbled across this page on generating documentation by the mono project, who just happen to be the ones behind mono, the cross-platform .NET runtime that runs on Linux, macOS, and, of course, Windows. They're also the ones who have built mcs, the C♯ compiler that compliments the mono runtime.

Apparently, they've also built a documentation generation that has the ability to export to HTML. While it's certainly nothing fancy, it does look like you've all the power you need at your fingertips to customise the look and feel by tweaking the XSLT stylesheet it renders with, should you need it.

After a bit of research, I found this pair of commands to build the documentation for me quite nicely:

mdoc update -o docs_xml/ -i GlidingSquirrel.xml GlidingSquirrel.dll
mdoc export-html --out docs/ docs_xml

The first command builds the multi-page XML tree from the XML documentation generated during the build process (make sure the "Generate xml documentation" option is ticked in the project build options!), and the second one transforms that into HTML that you can view in your browser.

With the commands figured out, I set about including it in my build process. Here's what I came up with:

<Target Name="AfterBuild">
    <Message Importance="high" Text="----------[ Building documentation ]----------" />

    <Exec Command="mdoc update -o docs_xml/ -i GlidingSquirrel.xml GlidingSquirrel.dll" WorkingDirectory="$(TargetDir)" IgnoreExitCode="true" />
    <Exec Command="mdoc export-html --out $(SolutionDir)/docs docs_xml" WorkingDirectory="$(TargetDir)" IgnoreExitCode="true" />
</Target>

This outputs the documentation to the docs/ folder in the root of the solution. If you're a bit confused as to how to utilise this in your own project, then perhaps the Untangling MSBuild post I wrote to figure out how MSBuild works can help! You can view the GlidingSquirrel's automatically-generated documentation here. It's nothing fancy (yet), but it certainly does the job. I'll have to look into prettifying it later!

Stardust Blog

Tag Cloud