Starbeamrainbowlabs

Stardust
Blog


Archive


Mailing List Articles Atom Feed Comments Atom Feed Twitter Reddit Facebook

Tag Cloud

3d 3d printing account algorithms android announcement architecture archives arduino artificial intelligence artix assembly async audio automation backups bash batch blender blog bookmarklet booting bug hunting c sharp c++ challenge chrome os cluster code codepen coding conundrums coding conundrums evolved command line compilers compiling compression conference conferences containerisation css dailyprogrammer data analysis debugging defining ai demystification distributed computing dns docker documentation downtime electronics email embedded systems encryption es6 features ethics event experiment external first impressions freeside future game github github gist gitlab graphics guide hardware hardware meetup holiday holidays html html5 html5 canvas infrastructure interfaces internet interoperability io.js jabber jam javascript js bin labs latex learning library linux lora low level lua maintenance manjaro minetest network networking nibriboard node.js open source operating systems optimisation outreach own your code pepperminty wiki performance phd photos php pixelbot portable privacy problem solving programming problems project projects prolog protocol protocols pseudo 3d python reddit redis reference release releases rendering research resource review rust searching secrets security series list server software sorting source code control statistics storage svg systemquery talks technical terminal textures thoughts three thing game three.js tool tutorial twitter ubuntu university update updates upgrade version control virtual reality virtualisation visual web website windows windows 10 worldeditadditions xmpp xslt

WorldEditAdditions: The story of the //convolve command

Sometimes, one finds a fascinating new use for an algorithm in a completely unrelated field to its original application. Such is the case with the algorithm behind the //convolve command in WorldEditAdditions which I recently blogged about. At the time, I had tried out the //smooth command provided by we_env, but The //smooth command smooths out rough edges in the terrain inside the defined WorldEdit region, but I was less than satisfied with it's flexibility, configurability, and performance.

To this end, I set about devising a solution. My breakthrough in solving the problem occurred when I realised that the problem of terrain smoothing is much easier when you consider it as a 2D heightmap, which is itself suspiciously like a greyscale image. Image editors such as GIMP already have support for smoothing with their various blur filters, the most interesting of which for my use case was the gaussian blur filter. By applying a gaussian blur to the terrain heightmap, it would have the effect of smoothing the terrain - and thus //convolve was born.

If you're looking for a detailed explanation on how to use this command, you're probably in the wrong place - this post is more of a developer commentary on how it is implemented. For an explanation on how to use the command, please refer to the chat command reference.

The gaussian blur effect is applied by performing an operation called a convolution over the input 2D array. This function is not specific to image editor (it can be found in AI too), and it calculates the new value for each pixel by taking a weighted sum of it's existing value and all of it's neighbours. It's perhaps best explained with a diagram:

All the weights in the kernel (sometimes called a filter) are floating-point numbers and should add up to 1. Since we look at each pixels neighbours, we can't operate on the pixels around the edge of the image. Different implementations have different approaches to solving this problem:

  1. Drop them in the output image (i.e. the output image is slightly smaller than the input) - a CNN in AI does this by default in most frameworks
  2. Ignore the pixel and pass it straight through
  3. Take only a portion of the kernel and remap the weights to that we can use that instead
  4. Wrap around to the other side of the image

For my implementation of the //convolve command, the heightmap in my case is essentially infinite in all directions (although I obviously request a small subset thereof), so I chose option 2 here. I calculate the boundary given the size of the kernel and request an area with an extra border around it so I can operate on all the pixels inside the defined region.

Let's look at the maths more specifically here:

Given an input image, for each pixel we take a multiply the kernel by each pixel and it's neighbours, and for each resulting matrix we sum all the values therein to get the new pixel value.

The specific weightings are of course configurable, and are represented by a rectangular (often square) 2D array. By manipulating these weightings, one can perform different kinds of operations. All of the following operations can be performed using this method:

Currently, WorldEditAdditions supports 3 different kernels in the //convolve command:

  • Box blur - frequently has artefacts
  • Gaussian Blur (additional parameter: sigma, which controls the 'blurriness') - the default, and also most complicated to implement
  • Pascal - A unique variant I derived from Pascal's Triangle. Works out as slightly sharper than Gaussian Blur without having artifacts like box blur.

After implementing the core convolution engine, writing functions to generate kernels of defined size for the above kernels was a relatively simple matter. If you know of another kernel that you think would be useful to have in WorldEditAdditions, I'm definitely open to implementing it.

For the curious, the core convolutional filter can be found in convolve.lua. It takes 4 arguments:

  1. The heightmap - a flat ZERO indexed array of values. Think of a flat array containing image data like the HTML5 Canvas getImageData() function.
  2. The size of the heightmap in the form { x = width, z = height }
  3. The kernel (internally called a matrix), in the same form as the heightmap
  4. The size of the kernel in the same form as the size of the heightmap.

Reflecting on the implementation, I'm pretty happy with it within the restrictions of the environment within which it runs. As you might have suspected based on my explanation above, convolutionally-based operations are highly parallelisable (there's a reason they are used in AI), and so there's a ton of scope for at using multiple threads, SIMD, and more - but unfortunately the Lua (5.1! It's so frustrating since 5.4 is out now) environment in Minetest doesn't allow for threading or other optimisations to be made.

The other thing I might do something about soon is the syntax of the command. Currently it looks like this:

//convolve <kernel> [<width>[,<height>]] [<sigma>]

Given the way I've used this command in-the-field, I've found myself specifying the kernel width and the sigma value more often than I have found myself changing the kernel. With this observation in hand, it would be nice to figure out a better syntax here that reduces the amount of typing for everyday use-cases.

In a future post about WorldEditAdditions, I want to talk about a number of topics, such as the snowballs algorithm in the //erode. I also potentially may post about the new //noise2d command I've working on, but that might take a while (I'm still working on implementing a decent noise function for that, since at the time of typing the Perlin noise implementation I've ported is less than perfect).

Variable-length fuzzy hashes with Nilsimsa for did you mean correction

Or, why fuzzy hashing isn't helpful for improving a search engine. Welcome to another blog post about one of my special interests: search engines - specifically the implementation thereof :D

I've blogged about search engines before, in which I looked at taking my existing search engine implementation to the next level by switching to a SQLite-based key-value datastore backing and stress-testing it with ~5M words. Still not satisfied, I'm now turning my attention to query correction. Have you ever seen something like this when you make a typo when you do a search?

Surprisingly, this is actually quite challenging to achieve. The problem is that given a word with a typo in it, while it's easy to determine if a word contains a typo, it's hard to determine what the correct version of the word is. Consider a wordlist like this:

apple
orange
pear
grape
pineapple

If the user entered something like pinneapple, then it's obvious to us that the correct spelling would be pineapple - but in order to determine this algorithmically you need an algorithm capable of determining how close 2 different words are to 1 another.

The most popular algorithm for this is called leveshtein. Given 2 words a and b, it calculates the number of edits to turn a into b. For example, the edit distance between pinneapple and pineapple is 1.

This is useful, but it still doesn't help us very much. With this, we'd have to calculate the leveshtein distance between the typo and all the words in the list. This could easily run into millions of words for large wikis, so this is obviously completely impractical.

To this end, we need a better idea. In this post, I'm going to talk about my first attempt at solving this problem. I feel it's important to document failures as well as successes, so this is part 1 of a 2 part series.

The first order of business is to track down a Nilsimsa implementation in PHP - since it doesn't come built-in, and it's pretty complicated to implement. Thankfully, this isn't too hard - I found this one on GitHub.

Nilsimsa is a fuzzy hashing algorithm. This means that if you hash 2 similar words, then you'll get 2 similar hashes:

Word Hash
pinneapple 020c2312000800920004880000200002618200017c1021108200421018000404
pineapple 0204239242000042000428018000213364820000d02421100200400018080200256

If you look closely, you'll notice that the hashes are quite similar. My thinking is that if we vary the hash size, then words that are similar will have identical hashes, allowing the search space to be cut down significantly. The existing Nilsimsa implementation I've found doesn't support that though, so we'll need to alter it.

This didn't turn out to be too much of a problem. By removing some magic numbers and adding a class member variable, it seems to work like a charm:

(Can't view the above? Try this direct link.)

I removed the comparison functions since I'm not using them (yet?), and also added a static convenience method for generating hashes. If I end up using this for large quantities of hashes, I may come back to it make it resettable, to avoid having to create a new object for every hash.

With this, we can get the variable-length hashes we wanted:

256       0a200240020004a180810950040a00d033828480cd16043246180e54444060a5
128       3ba286c0cf1604b3c6990f54444a60f5
64        02880ed0c40204b1
32        060a04f0
16        06d2
8         06

The number there is the number of bits in the hash, and the hex value is the hash itself. The algorithm defaults to 256-bit hashes. Next, we need to determine which sized hash is best. The easiest way to do this is to take a list of typos, hash the typo and the correction, and count the number of hashes that are identical.

Thankfully, there's a great dataset just for this purpose. Since it's formatted in CSV, we can download it and extract the typos and corrections in 1 go like this:

curl https://raw.githubusercontent.com/src-d/datasets/master/Typos/typos.csv | cut -d',' -f2-3 >typos.csv

There's also a much larger dataset too, but that one is formatted as JSON objects and would require a bunch of processing to get it into a format that would be useful here - and since this is just a relatively quick test to get a feel for how our idea works, I don't think it's too crucial that we use the larger dataset just yet.

With the dataset downloaded, we can run our test. First, we need to read the file in line-by line for every hash length we want to test:

<?php
$handle = fopen("typos.csv", "r");

$sizes = [ 256, 128, 64, 32, 16, 8 ];
foreach($sizes as $size) {
    fseek($handle, 0); // Jump back to the beginning
    fgets($handle); // Skip the first line since it's the header

    while(($line = fgets($handle)) !== false) {
        // Do something with the next line here
    }
}

PHP has an inbuilt function fgets() which gets the next line of input from a file handle, which is convenient. Next, we need to actually do the hashes and compare them:

<?php

// .....

$parts = explode(",", trim($line), 2);
if(strlen($parts[1]) < 3) {
    $skipped++;
    continue;
}
$hash_a = Nilsimsa::hash($parts[0], $size);
$hash_b = Nilsimsa::hash($parts[1], $size);

$count++;
if($hash_a == $hash_b) {
    $count_same++;
    $same[] = $parts;
}
else {
    $not_same[] = $parts;
}
echo("$count_same / $count ($skipped skipped)\r");

// .....

Finally, a bit of extra logic around the edges and we're ready for our test:

<?php
$handle = fopen("typos.csv", "r");
$line_count = lines_count($handle);
echo("$line_count lines total\n");

$sizes = [ 256, 128, 64, 32, 16, 8 ];
foreach($sizes as $size) {
    fseek($handle, 0);fgets($handle); // Skipt he first line since it's the header

    $count = 0; $count_same = 0; $skipped = 0;
    $same = []; $not_same = [];
    while(($line = fgets($handle)) !== false) {
        $parts = explode(",", trim($line), 2);
        if(strlen($parts[1]) < 3) {
            $skipped++;
            continue;
        }
        $hash_a = Nilsimsa::hash($parts[0], $size);
        $hash_b = Nilsimsa::hash($parts[1], $size);

        $count++;
        if($hash_a == $hash_b) {
            $count_same++;
            $same[] = $parts;
        }
        else $not_same[] = $parts;
        echo("$count_same / $count ($skipped skipped)\r");
    }

    file_put_contents("$size-same.csv", implode("\n", array_map(function ($el) {
        return implode(",", $el);
    }, $same)));
    file_put_contents("$size-not-same.csv", implode("\n", array_map(function ($el) {
        return implode(",", $el);
    }, $not_same)));

    echo(str_pad($size, 10)."→ $count_same / $count (".round(($count_same/$count)*100, 2)."%), $skipped skipped\n");
}

I'm writing the pairs that are the same and different to different files here for a visual inspection. I'm also skipping words that are less than 3 characters long, and that lines_count() function there is just a quick helper function for counting the number of lines in a file for the progress indicator (if you write a \r without a \n to the terminal, it'll reset to the beginning of the current line):

<?php
function lines_count($handle) : int {
    fseek($handle, 0);
    $count = 0;
    while(fgets($handle) !== false) $count++;
    return $count;
}

Unfortunately, the results of running the test aren't too promising. Even with the shortest hash the algorithm will generate without getting upset, only ~23% of typos generate the same hash as their correction:

7375 lines total
256       → 7 / 7322 (0.1%), 52 skipped
128       → 9 / 7322 (0.12%), 52 skipped
64        → 13 / 7322 (0.18%), 52 skipped
32        → 64 / 7322 (0.87%), 52 skipped
16        → 347 / 7322 (4.74%), 52 skipped
8         → 1689 / 7322 (23.07%), 52 skipped

Furthermore, digging deeper with an 8-bit you start to get large numbers of words that have the same hash, which isn't ideal at all.

A potential solution here would be to use hamming distance (basically counting the number of bits that are different in a string of binary) to determine which hashes are similar to each other like leveshtein distance does, but that still doesn't help us as we then still have a problem that's almost identical to where we started.

In the second part of this mini-series, I'm going to talk about how I ultimately solved this problem. While the algorithm I ultimately used (a BK-Tree, more on them next time) is certainly not the most efficient out there (it's O(log n) if I understand it correctly), it's very simple to implement and is much less complicated than Symspell, which seems to be the most efficient algorithm that exists at the moment.

Additionally, I have been able to optimise said algorithm to return results for a 172K wordlist in ~110ms, which is fine for my purposes.

Found this interesting? Got another algorithm I should check out? Got confused somewhere along the way? Comment below!

Next Gen Search, Part 1: Backend Storage

I've got a bit of a thing about full-text search engines. I've talked about one in particular before for Pepperminty Wiki, and I was thinking about it again the other day - and how I could optimise it further.

If you haven't already, I do recommend reading my previous post on the curious question - as a number of things in this post might not make sense otherwise.

Between the time I wrote that last post and now, I've done quite a bit more digging into the root causes of that ~450ms search time, and I eventually determined that most of it was actually being spent deserialising the inverted index from the JSON file it's stored in back into memory.

This is horribly inefficient and is actually taking up 90% of that query time, so I got to thinking about what I could do about it.

My solution was multi-faceted, as I also (separately) developed a new search-term analysis system (I abbreviated to STAS, because it sounds cool :D) to add advanced query syntax such as -dog to exclude pages that contain the word dog and the like - which I also folded in at the same time as fixing the existing bottleneck.

As of the time of typing, the work on STAS is still ongoing. This doesn't mean that I can't blog about the rest of it though! I've recently-ish come into contact with key-value data stores, such as Redis and LevelDB (now RocksDB). They work rather like a C♯ dictionary or a browser's localStorage, in that they store values that are associated with unique keys (Redis is a bit of a special case though, for reasons that I won't get into here).

Such a data store would suit an inverted index surprisingly well. I devised a key system to represent an inverted index:

The devised key system in diagram form, explained below.

  • The first key, |termindex|, is used to store a master list of words that have entries in the page index.
  • The second key, term is simply the word itself (e.g. cat, chicken, rocket, etc.), and stores a list of ids of the pages that contain that word.
  • The third and final key, term|pageid, is a word followed by a separator and finally the id of a page (e.g. bill|1, cat|45, boosters|69, etc.). These keys store the locations that a particular word appears at in a given document in the index. A separate id index is needed to convert between the page id and it's associated name - Pepperminty Wiki provides this functionality out-of-the-box.

The more I thought about it, the more I liked it. If I use a key-value data store in this manner, I can store the values as JSON objects - and then I only have to deserialise the parts of the index that I actually use. Furthermore, adding this extra layer of abstraction allows for some clever trickery to speed things up even more.

The problem here is that Pepperminty Wiki is supposed to be portable, so I try not to use any heavy external libraries or depend on odd PHP modules that not everyone will have installed. While a LevelDB extension for PHP does exist, it's not installed by default and it's a PECL module, which are awkward to install.

All isn't lost though, because it turns out that SQLite functions surprisingly well as a key-value data store:

CREATE TABLE store (
    key TEXT UNIQUE NOT NULL,
    value TEXT
);

Yes - it really is that simple! Now all we need is some code to create and interface with such a database. Some simple getters and setters should suffice!

(Can't see the above? Try a direct link.)

While this works, I quickly ran into several major problems:

  • Every time I read from the data store I'm decoding JSON, which is expensive
  • Every time I'm saving to the data store, I'm encoding to JSON, which is also expensive
  • I'm reading and writing the same thing multiple times, which is very expensive
  • Writing anything to the data store takes a long time

The (partial) solution here was to refactor it such that the json encoding is handled by the storage provider, and to implement a cache.

Such a cache could just be an associative array:

private $cache = [];

Then, to fetch a value, we can do this:

// If it's not in the cache, insert it
if(!isset($this->cache[$key])) {
    $this->cache[$key] = [ "modified" => false, "value" => json_decode($this->query(
        "SELECT value FROM store WHERE key = :key;",
        [ "key" => $key ]
    )->fetchColumn()) ];
}
return $this->cache[$key]["value"];

Notice how each item in the cache is also an associative array. This is so that we can flag items that have been modified in memory, such that when we next sync to disk we can write multiple changes all at once in a single batch. That turns the setter into something like this:

if(!isset($this->cache[$key])) $this->cache[$key] = [];
$this->cache[$key]["value"] = $value;
$this->cache[$key]["modified"] = true;

Very cool! Now all we need is a function to batch-write all the changes to disk. This isn't hard either:

foreach($this->cache as $key => $value_data) {
    // If it wasn't modified, there's no point in saving it, is there?
    if(!$value_data["modified"])
        continue;

    $this->query(
        "INSERT OR REPLACE INTO store(key, value) VALUES(:key, :value)",
        [
            "key" => $key,
            "value" => json_encode($value_data["value"])
        ]
    );
}

I'll get onto the cogs and wheels behind that query() function a bit later in this post. It's one of those utility functions that are great to have around that I keep copying from one project to the next.

Much better, but still not great. Why does it still take ages to write all the changes to disk?

Well, it turns out that by default SQLite wraps every INSERT in it's own transaction. If we wrap our code in an explicit transaction, we can seriously boost the speed even further:

$this->db->beginTransaction();

// Do batch writing here

$this->db->commit();

Excellent (boomed the wizard).

But wait, there's more! The PHP PDO database driver supports prepared statements, which is a fancy way of caching SQL statements and substituting values into them. We use these already, but since we only use a very limited number of SQL queries, we can squeak some additional performance out by caching them in their prepared forms (preparing them is relatively computationally expensive, after all).

This is really easy to do. Let's create another associative array to store them in:

private $query_cache = [];

Then, we can alter the query() function to look like this:

/**
 * Makes a query against the database.
 * @param   string  $sql        The (potentially parametised) query to make.
 * @param   array   $variables  Optional. The variables to substitute into the SQL query.
 * @return  \PDOStatement       The result of the query, as a PDOStatement.
 */
private function query(string $sql, array $variables = []) {
    // Add to the query cache if it doesn't exist
    if(!isset($this->query_cache[$sql]))
        $this->query_cache[$sql] = $this->db->prepare($sql);
    $this->query_cache[$sql]->execute($variables);
    return $this->query_cache[$sql]; // fetchColumn(), fetchAll(), etc. are defined on the statement, not the return value of execute()
}

If a prepared statement for the given SQL exists in the query cache already, we re-use that again instead of preparing a brand-new one. This works especially well, since we perform a large number of queries with the same small set of SQL queries. Get operations all use the same SQL query, so why not cache it?

This completes our work on the backend storage for the next-generation search system I've built for Pepperminty Wiki. Not only will it boost the speed of Pepperminty Wiki's search engine, but it's also a cool reusable component, which I can apply to other applications to give them a boost, too!

Next time, we'll take a look at generating the test data required to stress-test the system.

I also want to give an overview of the new search-term analysis system and associated changes to the page ranking system I implemented (and aspire to implement) too, but those may have to wait until another post (teaser: PageRank isn't really what I'm after).

(Can't see the above? Try a direct link.)

Next

Easy AI with Microsoft.Text.Recognizers

I recently discovered that there's an XMPP client library (NuGet) for .NET that I overlooked a few months ago, and so I promptly investigated the building of a bot!

The actual bot itself needs some polishing before I post about it here, but in writing said bot I stumbled across a perfectly brilliant library - released by Microsoft of all companies - that can be used to automatically extract common data-types from a natural-language sentence.

While said library is the underpinnings of the Azure Bot Framework, it's actually free and open-source. To that end, I decided to experiment with it - and ended up writing this blog post.

Data types include (but are not limited to) dates and times (and ranges thereof), numbers, percentages, monetary amounts, email addresses, phone numbers, simple binary choices, and more!

While it also lands you with a terrific number of DLL dependencies in your build output folder, the result is totally worth it! How about pulling a DateTime from this:

in 5 minutes

or this:

the first Monday of January

or even this:

next Monday at half past six

Pretty cool, right? You can even pull multiple things out of the same sentence. For example, from the following:

The host 1.2.3.4 has been down 3 times over the last month - the last of which was from 5pm and lasted 30 minutes

It can extract an IP address (1.2.3.4), a number (3), and a few dates and times (last month, 5pm, 30 minutes).

I've written a test program that shows it in action. Here's a demo of it working:

(Can't see the asciicast above? View it on asciinema.org)

The source code is, of course, available on my personal Git server: Demos/TextRecogniserDemo

If you can't check out the repo, here's the basic gist. First, install the Microsoft.Recognizers.Text package(s) for the types of data that you'd like to recognise. Then, to recognise a date or time, do this:

List<ModelResult> result = DateTimeRecognizer.RecognizeDateTime(nextLine, Culture.English);

The awkward bit is unwinding the ModelResult to get at the actual data. The matched text is stored in the ModelResult.Resolution property, but that's a SortedDictionary<string, object>. The interesting property inside which is value, but depending on the data type you're recognising - that can be an array too! The best way I've found to decipher the data types is to print the value of ModelResult.Resolution as a string to the console:

Console.WriteLine(result[0].Resolution.ToString());

The .NET runtime will helpfully convert this into something like this:

System.Collections.Generic.SortedDictionary`2[System.String,System.Object]

Very helpful. Then we can continue to drill down:

Console.WriteLine(result[0].Resolution["values"]);

This produces this:

System.Collections.Generic.List`1[System.Collections.Generic.Dictionary`2[System.String,System.String]]

Quite a mouthful, right? By cross-referencing this against the JSON (thanks, Newtonsoft.JSON!), we can figure out how to drill the rest of the way. I ended up writing myself a pair of little utility methods for dates and times:

public static DateTime RecogniseDateTime(string source, out string rawString)
{
    List<ModelResult> aiResults = DateTimeRecognizer.RecognizeDateTime(source, Culture.English);
    if (aiResults.Count == 0)
        throw new Exception("Error: Couldn't recognise any dates or times in that source string.");

    /* Example contents of the below dictionary:
        [0]: {[timex, 2018-11-11T06:15]}
        [1]: {[type, datetime]}
        [2]: {[value, 2018-11-11 06:15:00]}
    */

    rawString = aiResults[0].Text;
    Dictionary<string, string> aiResult = unwindResult(aiResults[0]);
    string type = aiResult["type"];
    if (!(new string[] { "datetime", "date", "time", "datetimerange", "daterange", "timerange" }).Contains(type))
        throw new Exception($"Error: An invalid type of {type} was encountered ('datetime' expected).");

    string result = Regex.IsMatch(type, @"range$") ? aiResult["start"] : aiResult["value"];
    return DateTime.Parse(result);
}

private static Dictionary<string, string> unwindResult(ModelResult modelResult)
{
    return (modelResult.Resolution["values"] as List<Dictionary<string, string>>)[0];
}

Of course, it depends on your use-case as to precisely how you unwind it, but the above should be a good starting point.

Once I've polished the bot I've written a bit, I might post about it on here.

Found this interesting? Run into an issue? Got a neat use for it? Comment below!

Using libsodium to upgrade the comment key system

I've blogged about the comment key system I utilise on this blog to prevent spam before (see also). Today, I've given it another upgrade to make it harder for spammers to fake a comment key!

In the last post, I transformed the comment key with a number of reversible operations - including a simple XOR password system. This is, of course, very insecure - especially since an attacker knows (or can at least guess) the content of the encrypted key, making it trivial (I suspect) to guess the password used for 'encryption'.

The solution here, obviously, is to utilise a better encryption system. Since it's the 5th November and I'm not particularly keen on my website going poof like the fireworks tonight, let's do something about it! PHP 7.2+ comes with native libsodium support (those still using older versions of PHP can still follow along! Simply install the PECL module). libsodium bills itself as

A modern, portable, easy to use crypto library.

After my experiences investigating it, I'd certainly say that's true (although the official PHP documentation could do with, erm, existing). I used this documentation instead instead - it's quite ironic because I have actually base64-encoded the password.......

Anyway, after doing some digging I found the quick reference, which explains how you can go about accomplishing common tasks with libsodium. For my comment key system, I want to encrypt my timestamp with a password - so I wanted the sodium_crypto_secretbox() and its associated sodium_crypto_secretbox_open() functions.

This pair of functions, when used together, implement a secure symmetric key encryption system. In other words, they securely encrypt a lump of data with a password. They do, however, have 2 requirements that must be taken care of. Firstly, the password must be of a specific length. This is easy enough to accomplish, as PHP is kind enough to tell us how long it needs to be:

/**
 * Generates a new random comment key system password.
 * @return string   The base64-encoded password.
 */
function key_generate_password() {
    return base64_encode_safe(random_bytes(SODIUM_CRYPTO_SECRETBOX_KEYBYTES));
}

Easy-peasy! base64_encode_safe isn't a built-in function - it's a utility function I wrote that we'll need later. For consistency, I've used it here too. Here it is, along with its counterpart:

/**
 * Encodes the given data with base64, but replaces characters that could 
 * get mangled in transit with safer equivalents.
 * @param   mixed   $data   The data to encode.
 * @return  string  The data, encoded as transmit-safe base64.
 */
function base64_encode_safe($data) {
    return str_replace(["/", "+"], ["-", "_"], base64_encode($data));
}
/**
 * Decodes data encoded with base64_encode_safe().
 * @param   mixed   $base64     The data to decode.
 * @return  string  The data, decoded from transmit-safe base64.
 */
function base64_decode_safe($base64) {
    return base64_decode(str_replace(["-", "_"], ["/", "+"], $base64));
}

With that taken care of, we can look at the other requirement: a nonce. Standing for Number used ONCE, it's a sequence of random bytes that's used by the encryption algorithm. We don't need to keep it a secret, but we do need to to decrypt the data again at the other end, in addition to the password - and we do need to ensure that we generate a new one for every comment key. Thankfully, this is also easy to do in a similar manner to generating a password:

$nonce = random_bytes(SODIUM_CRYPTO_SECRETBOX_NONCEBYTES);

With everything in place, we can look at upgrading the comment key generator itself. It looks much simpler now:

/**
 * Generates a new comment key.
 * Note that the 'encryption' used is not secure - it's just simple XOR!
 * It's mainly to better support verification in complex setups and 
 * serve as a nother annoying hurdle for spammers.
 * @param  string $pass The password to encrypt with. Should be a base64-encoded string from key_generate_password().
 * @return string       A new comment key stamped at the current time.
 */
function key_generate($pass) {
    $pass = base64_decode_safe($pass);
    $nonce = random_bytes(SODIUM_CRYPTO_SECRETBOX_NONCEBYTES);
    $encrypt = sodium_crypto_secretbox(
        strval(time()), // The thing we want to encrypt
        $nonce, // The nonce
        $pass // The password to encrypt with
    );

    return base64_encode_safe($nonce.$encrypt);
}

I bundle the nonce with the encrypted data here to ensure that the system continues to be stateless (i.e. we don't need to store any state information about a user on the server). I also encode the encrypted string with base64, as the encrypted strings contain lots of nasty characters (it's actually a binary byte array I suspect). This produces keys like this:

BOqDvr26XY9s8PhlmIZMIp6xCOZyfsh6J05S4Jp2cY3bL8ccf_oRgRMldrmzKk6RrnA=
Tt8H81tkJEqiJt-RvIstA_vz13LS8vjLgriSAvc1n5iKwHuEKjW93IMITikdOwr5-NY=
5NPvHg-l1GgcQ9ZbThZH7ixfKGqAtSBr5ggOFbN_wFRjo3OeJSWAcFNhQulYEVkzukI=

They are a bit long, but not unmanageable. In theory, I could make it a bit shorter by avoiding converting the integer output from time() to a string, but in my testing it didn't make much difference. I suspect that there's some sort of minimum length to the output string for some (probably good) reason.


php > var_dump(sodium_crypto_secretbox(time(), random_bytes(24), random_bytes(32)));
string(26) "GW$:���ߌ@]�+1b��������d%"
php > var_dump(sodium_crypto_secretbox(strval(time()), random_bytes(24), random_bytes(32)));
string(26) ":_}0H �E°9��Kn�p��ͧ��"

Now that we're encrypting comment keys properly, it isn't much good unless we can decrypt them as well! Let's do that too. The first step is to decode the base64 and re-split the nonce from the encrypted data:

$pass = base64_decode_safe($pass);
$key_enc_raw = base64_decode_safe($key);
$nonce = mb_substr($key_enc_raw, 0, SODIUM_CRYPTO_SECRETBOX_NONCEBYTES, "8bit");
$key_enc = mb_substr($key_enc_raw, SODIUM_CRYPTO_SECRETBOX_NONCEBYTES, null, "8bit");

This is derived from the example code. With the 2 separated once more, we can decrypt the key:

$key_dec = sodium_crypto_secretbox_open($key_enc, $nonce, $pass);

Apparently, according to the example code on the website I linked to above, if the return value isn't a string then the decryption failed. We should make sure we handle that when returning:

if(!is_string($key_dec))
    return null;
return intval($key_dec);

That completes the decryption code. Here is in full:

/**
 * Decodes the given key.
 * @param  string $key  The key to decode.
 * @param  string $pass The password to decode it with.
 * @return int|null     The time that the key was generated, or null if the provided key was invalid.
 */
function key_decode($key, $pass) {
    $pass = base64_decode_safe($pass);
    $key_enc_raw = base64_decode_safe($key);
    $nonce = mb_substr($key_enc_raw, 0, SODIUM_CRYPTO_SECRETBOX_NONCEBYTES, "8bit");
    $key_enc = mb_substr($key_enc_raw, SODIUM_CRYPTO_SECRETBOX_NONCEBYTES, null, "8bit");
    $key_dec = sodium_crypto_secretbox_open($key_enc, $nonce, $pass);
    if(!is_string($key_dec)) return null;
    return intval($key_dec);
}

Explicitly returning null in key_decode() requires a small change to key_verify(), in order to prevent it from thinking that a key is really old if the decryption fails (null is treated as 0 in arithmetic operations, apparently). Let's update key_verify() to handle that:

/**
 * Verifies a key.
 * @param   string  $key        The key to decode and verify.
 * @param   string  $pass       The password to decode the key with.
 * @param   int     $min_age    The minimum required age for the key.
 * @param   int     $max_age    The maximum allowed age for the key.
 * @return  bool                Whether the key is valid or not.
 */
function key_verify($key, $pass, $min_age, $max_age) {
    $key_time = key_decode($key, $pass);
    if($key_time == null) return false;
    $age = time() - $key_time;
    return $age >= $min_age && $age <= $max_age;
}

A simple check is all that's needed. With that, the system is all updated! Time to generate a new password with the provided function and put it to use. You can do that directly from the PHP REPL (php -a):


php > require("lib/comment_system/comment_key.php");
php > var_dump(key_generate_password());                                        string(44) "S0bJpph5htdaKHZdVfo9FB6O4jOSzr3xZZ94c2Qcn44="

Obviously, this isn't the password I'm using on my blog :P

I've got a few more ideas I'd like to play around with to deter spammers even more whilst maintaining the current transparency - so you may see another post soon on this subject :-)

With everything complete, here's the complete script:

(Above: The full comment key system code. Can't see it? Check it out on GitHub Gist here.)

Found this useful? Got an improvement? Confused about something? Comment below!

Sources and Further Reading

Generating word searches for fun and profit

(Above: A Word search generated with the tool below)

A little while ago I was asked about generating a wordsearch in a custom shape. I thought to myself "someone has to have built this before...", and while I was right to an extent, I couldn't find one that let you use any shape you liked.

This, of course, was unacceptable! You've probably guessed it by now, but I went ahead and wrote my own :P

While I wrote it a little while ago, I apparently never got around to posting about it on here.

In short, it works by using an image you drop into the designated area on the page as the shape the word search should take. Each pixel is a single cell of the word search - with the alpha channel representing whether or not a character is allowed to be placed there (transparent means that it can't contain a character, and opaque means that it can).

Creating such an image is simple. Personally, I recommend Piskel or GIMP for this purpose.

Once done, you can start building a wordlist in the wordlist box at the right-hand-side. It should rebuild the word search as soon as you click out of the box. If it doesn't, then you've found a bug! Please report it here.

With the word search generated, you can use the Question Sheet and Answer Sheet links to open printable versions for export.

You can find my word search generator here:

I've generated a word search of the current tags in the tag cloud on this blog too: Question Sheet [50.3KiB], Answer Sheet [285.6KiB]

The most complicated part of this was probably the logistics behind rude word removal. Thankfully, I did't have to find and maintain such a list of words, as the futility npm package does this for me, but algorithmically guaranteeing that by censoring 1 rude word another is not accidentally created in another direction is a nasty problem.

If you're interested in a more technical breakdown of one (or several!) particular aspects of this - let me know! While writing about all of it would probably make for an awfully long post, a specific aspect or two should be more manageable.

In the future, I'll probably revisit this and add additional features to it, such as the ability to restrict which directions words are placed in, for example. If you've got a suggestion of your own, open an issue (or even better, open a pull request :D)!

Placeholder Image Generator

A 1920x1080 white-text-on-black placeholder image, covered with a bunch of brightly covered debug crosses and coloured texts in different (meaningful?) places.

(Above: An example output with debugging turned on from my placeholder generation service)

For a quite a considerable amount of time now, I've been running my own placeholder image generation service here at starbeamrainbowlabs.com - complete with etag generation and custom colour selection. Although it's somewhat of an Easter Egg, it's not actually that hard to find if you know what you're looking for (hint: I use it on my home page, but you may need to view source to find it).

I decided to post about it now because I've just finished fixing the angle GET parameter - and it was interesting (and hard) enough to warrant a post to remind myself how I did it for future reference. Comment below if you knew about it before this blog post came out!

A randomly coloured placeholder image.

The script itself is split into 3 loose parts:

  • The initial settings / argument parsing
  • The polyfills and utility functions
  • The image generation.

My aim here was to keep the script contained within a single file - so that it fits well in a gist (hence why it currently looks a bit messy!). It's probably best if show the code in full:

(Having trouble viewing code above? Visit it directly here: placeholder.php)

If you append ?help to the url, it will send back a plain text file detailing the different supported parameters and what they do:


(Can't see the above? View it directly here)

Aside from implementing the random value for fg_colour and bg_colour, the angle has been a huge pain. I use GD - a graphics drawing library that's bundled with practically every PHP installation ever - to draw the placeholder image, and when you ask it to draw some rotated text, it decides that it's going to pivot it around the bottom-left corner instead of the centre.

Naturally, this creates some complications. Though some people on the PHP manual page said method (imagettftext) have attempetd to correct for this (exhibits a and b), neither of their solutions work for me. I don't know if it's because their code is just really old (13 and 5 years respectively as of the time of typing) or what.

Another randomly coloured placeholder image - this time thinner and with the text on a slight angle, with the text Enough! instead of the dimensions of the generated image.

Anyway, I finally decided recently that enough was enough - and set about fixing it myself. Basing my code on the latter of the 2 pre-existing solutions, I tried fixing it - but ended up deleting most of it and starting again. It did give me some idea as to how to solve the problem though - all I needed to do was find the centre of where the text would be drawn when it is both not rotated and rotated and correct for it (these are represented by the green and blue crosses respectively on the image at the top of this post).

PHP does provide a method for calculating the bounding box of some prospective text that you're thinking of drawing via imagettfbbox. Finding the centre of the box though sounded like a horrible maths-y problem that would take me ages to work through on a whiteboard first, but thankfully (after some searching around) Wikipedia had a really neat solution for finding the central point of any set of points.

It calls it the centroid, and claims that the geometric centre of a set of points is simply the average of all the points involved. It just so happens that the geometric centre is precisely what I'm after:

$$ C=\frac{a+b+c+d+....}{N} $$

...Where $C$ is the geometric centre of the shape as an $\binom{x}{y}$ vector, $a$, $b$, $c$, $d$, etc. are the input points as vectors, and $N$ is the number of points in question. This was nice and easy to program:

$bbox = imagettfbbox($size, 0, $fontfile, $text);
$orig_centre = [
    ($bbox[0] + $bbox[2] + $bbox[4] + $bbox[6]) / 4,
    ($bbox[1] + $bbox[3] + $bbox[5] + $bbox[7]) / 4
];
$text_width = $bbox[2] - $bbox[0];
$text_height = $bbox[3] - $bbox[5];

$bbox = imagettfbbox($size, $angle, $fontfile, $text);
$rot_centre = [
    ($bbox[0] + $bbox[2] + $bbox[4] + $bbox[6]) / 4,
    ($bbox[1] + $bbox[3] + $bbox[5] + $bbox[7]) / 4
];

The odd and even indexes of $bbox there are referring to the $x$ and $y$ co-ordinates of the 4 corners of the bounding boxes - imagettfbbox outputs the co-ordinates in 1 single array for some reason. I also calculate the original width and height of the text - in order to perform an additional corrective translation later.

With these in hand, I could calculate the actual position I need to draw it (indicated by the yellow cross in the image at the top of this post):

$$ delta = C_{rotated} - C_{original} $$

.

$$ pivot = \frac{pos_x - ( delta_x + \frac{text_width}{2})}{pos_y - delta_y + \frac{text_height}{2}} $$

In short, I calculate the distance between the 2 centre points calculated above, and then find the pivot point by taking this distance plus half the size of the text from the original central position we wanted to draw the text at. Here's that in PHP:

// Calculate the x and y shifting needed
$dx = $rot_centre[0] - $orig_centre[0];
$dy = $rot_centre[1] - $orig_centre[1];

// New pivot point
$px = $x - ($dx + $text_width/2);
$py = $y - $dy + $text_height/2;

I generated a video of it in action (totally not because I thought it would look cool :P) with a combination of curl, and ffmpeg:

(Having issues with the above video? Try downloading it directly, or viewing it on YouTube)

This was really quite easy to generate. In a new folder, I executed the following:

echo {0..360} | xargs -P3 -d' ' -n1 -I{} curl 'https://starbeamrainbowlabs.com/images/placeholder/?width=1920&height=1920&fg_colour=inverse&bg_colour=white&angle={}' -o {}.jpeg
ffmpeg -r 60 -i %d.jpeg -q:v 3 /tmp/text-rotation.webm

The first command downloads all the required frames (3 at a time), and the second stitches them together. The -q:v 3 bit of the ffmpeg command is of note - by default webm videos apparently have a really low quality - this corrects that. Lower is better, apparently - it goes up to about 40 I seem to remember reading somewhere. 3 to 5 is supposed to be the right range to get it to look ok without using too much disk space.

The word Conclusion in blue on a yellow background, at a 20 degree angle - generated with my placeholder image generation service :D

That's about everything I wanted to talk about to remind myself about what I did. Let me know if there's anything you're confused about in the comments below, and I'll explain it in more detail.

(Found this interesting? Comment below!)

Sources and Further Reading

Demystifying Inverted Indexes

The test texts below overlaying one another in different colours, with a magnifying glass on centred top. (The magnifying glass in the above banner came from openclipart)

After writing the post that will be released after this one, I realised that I made a critical assumption that everyone knew what an inverted index was. Upon looking for an appropriate tutorial online, I couldn't find one that was close enough to what I did in Pepperminty Wiki, so I decided to write my own.

First, some context. What's Pepperminty Wiki? Well, it's a complete wiki engine in a single file of PHP. The source files are obviously not a single file, but it builds into a single file - making it easy to drop into any PHP-enabled web server.

One of its features is a full-text search engine. A personal wiki of mine has ~75k words spread across ~550 pages, and it manages to search them all in just ~450ms! It does this with the aid of an inverted index - which I'll be explaining in this post.

First though, we need some data to index. How about the descriptions of some video games?

Kerbal Space Program

In KSP, you must build a space-worthy craft, capable of flying its crew out into space, without killing them. At your disposal is a collection of parts, which must be assembled to create a functional ship. Each part has its own function and will affect the way a ship flies (or doesn't). So strap yourself in, and get ready to try some Rocket Science!

 

Cross Code

Meet Lea as she logs into an MMO of the distant future. Follow her steps as she discovers a vast world, meets other players and overcomes all the challenges of the game.

 

Fort Meow

Fort Meow is a physics game by Upper Class Walrus about a girl, an old book and a house full of cats! Meow.

 

Factory Balls

Factory balls is the brand new bonte game I already announced yesterday. Factory balls takes part in the game design competition over at jayisgames. The goal of the design competition was to create a 'ball physics'-themed game. I hope you enjoy it!

Very cool, this should provide us with plenty of data to experiment with. Firstly, let's consider indexing. Take the Factory Balls description. We can split it up into tokens like this:

T o k e n s V V
factory balls is the brand new bonte game
i already announced yesterday factory balls takes
part in the game design competition over
at jayisgames the goal of the design
competition was to create a ball physics
themed game i hope you enjoy it

Notice how we've removed punctuation here, and made everything lowercase. This is important for the next step, as we want to make sure we consider Factory and factory to be the same word - otherwise when querying the index we'd have to remember to get the casing correct.

With our tokens sorted, we can now count them to create our index. It's like a sort of tally chart I guess, except we'll be including the offset in the text of every token in the list. We'll also be removing some of the most common words in the list that aren't very interesting - these are known as stop words. Here's an index generated from that Factory Balls text above:

Token Frequency Offsets
factory 2 0, 12
balls 2 1, 13
brand 1 4
new 1 5
bonte 1 6
game 3 7, 18, 37
i 2 8, 38
announced 1 10
yesterday 1 11
takes 1 14
design 2 19, 28
competition 2 20, 29
jayisgames 1 23
goal 1 25
create 1 32
ball 1 34
physics 1 35
themed 1 36
hope 1 39
enjoy 1 41

Very cool. Now we can generate an index for each page's content. The next step is to turn this into an inverted index. Basically, the difference between the normal index and a inverted index is that an entry in an inverted index contains not just the offsets for a single page, but all the pages that contain that token. For example, the Cross-Code example above also contains the token game, so the inverted index entry for game would contain a list of offsets for both the Factory Balls and Cross-Code pages.

Including the names of every page under every different token in the inverted index would be both inefficient computationally and cause the index to grow rather large, so we should assign each page a unique numerical id. Let's do that now:

Id Page Name
1 Kerbal Space Program
2 Cross Code
3 Fort Meow
4 Factory Balls

There - much better. In Pepperminty Wiki, this is handled by the ids class, which has a pair of public methods: getid($pagename) and getpagename($id). If an id can't be found for a page name, then a new id is created and added to the list (Pepperminty Wiki calls this the id index) transparently. Similarly, if a page name can't be found for an id, then null should be returned.

Now that we've got ids for our pages, let's look at generating that inverted index entry for game we talked about above. Here it is:

  • Term: game
Id Frequency Offsets
2 1 31
3 1 5
4 3 5, 12, 23

Note how there isn't an entry for page id 1, as the Kerbal Space Program page doesn't contain the token game.

This, in essence, is the basics of inverted indexes. A full inverted index will contain an entry for every token that's found in at least 1 source document - though the approach used here is far from the only way of doing it (I'm sure there are much more advanced ways of doing it for larger datasets, but this came to mind from reading a few web articles and is fairly straight-forward and easy to understand).

Can you write a program that generates a full inverted index like I did in the example above? Try testing it on the test game descriptions at the start of this post.

You may also have noticed that the offsets used here are of the tokens in the list. If you wanted to generate contexts (like Duck Duck Go or Google do just below the title of a result), you'd need to use the character offsets from the source document instead. Can you extend your program to support querying the inverted index, generating contexts based on the inverted index too?

Liked this post? Got your own thoughts on the subject? Having trouble with the challenges at the end? Comment below!

Shift-Reduce Parser Part 2: Building Furniture (1)

Hello and welcome! I got a bit distracted by other things as you can tell, but I'm back with part 2 of my series on building a shift-reduce parser. If you're not sure what I'm talking about, then I'd advise reading part 1 first and then coming back here. It might be a good idea to re-read it anyway, juts to refresh your memory :-)

The same flowchart from last time, but with the parse table section highlighted.

Last time, we created some data classes to store the various rules and tokens that we'll be generating. Today, we're going to build on that and start turning a set of rules into a parse table. Let's introduce the rules we'll working with:

<start> ::= <expression>

<expression> ::= <expression> PLUS <value>
    | <term>

<term> ::= <term> DIVIDE <value>
    | <value>

<value> ::= <number>
    | BRACKET_OPEN <expression> BRACKET_CLOSE

<number> ::= DIGIT
    | <number> DIGIT

The above represents a very basic calculator-style syntax, which only supports adding and dividing. It's written in Backus-Naur Form, which is basically a standardised way of writing parsing rules.

To build a parse table, we first must understand what such a thing actually is. Let's take a look at an example:

state action goto
* + 0 1 $ E B
0 s1 s2 3 4
1 r4 r4 r4 r4 r4
2 r5 r5 r5 r5 r5
3 s5 s6 goal
4 r3 r3 r3 r3 r3
5 s1 s2 7
6 s1 s2 8
7 r1 r1 r1 r1 r1
8 r2 r2 r2 r2 r2

_(Source: Adapted from the LR Parser on Wikipedia.)_

While it looks complex at first, let's break it down. There are 3 parts to this table: The state, the action, and the goto. The action and goto represent What should happen if a particular token is encountered. In this case, the input stream contains both terminal (i.e. DIGIT, DIVIDE, BRACKET_CLOSE, etc. in the case of our BNF above) and non-terminal (i.e. number, term, expression, etc. in the case of our BNF above) symbols - if understand it correctly, so there are actually 2 parts to the table here to make sure that both are handled correctly.

We'll be connecting this to our lexer, which outputs only terminal symbols, so we should be ok I believe (if you know better, please post a comment below!). The state refers to the state in the table. As I've mentioned before, a given state may contain one or more configurations. It's these configurations that give rise to the actions in the table above, such as s2 (shift and then go to state 2) or r3 (reduce and jump to state 3).

To use the table, the parser must know what state it's in, and then take a look across the top row for the next symbol it has in the token stream. Once found, it can follow it down to figure out what action it should take, as explained above. If there isn't an action in the box, then there must be an error in the input, as the table doesn't tell us what to do in this situation. To that end, we should try and generate a meaningful error message to help the user to find the mistake in the input (or the developer in the parser!).

We're kind of getting ahead of ourselves here though. We need to build this table first, and to do that, we need to figure out which configurations go in which state. And, going down the rabbit hole, we need to know what a configuration is. Again, it's best if I demonstrate. Consider the following parsing rule from our example BNF at the beginning of this post:

<value> ::= BRACKET_OPEN <expression> BRACKET_CLOSE

A single configuration represent a possible state of the parser at a particular instant in time. I could split that above rule up like so:

<value> ::= BRACKET_OPEN * <expression> BRACKET_CLOSE
<value> ::= BRACKET_OPEN <expression> * BRACKET_CLOSE
<value> ::= BRACKET_OPEN <expression> BRACKET_CLOSE *

The asterisk represent where the parser might have gotten up to. Everything to the left is on the stack of the parser, and everything to the right hasn't happened yet.

Since this isn't a top-level rule (in our example that honour goes to the rule for the start non-terminal), the parser will never be in a position where the first item there doesn't exist yet on the stack, so we can ignore the configuration in which the asterisk would be to the left of BRACKET_OPEN.

Confused? Let me try and help here. Let's draw a diagram of how our parser is going to operate:

_(Source: Made by me, but adapted from the LR Parser article on Wikipedia)_

Basically, the parser will be taking in the input token stream and either shift a new terminal token onto the stack, or reduce one or more existing tokens on the stack into a single non-terminal token, which replaces those existing tokens on the stack. The configurations above represent possible states of the stack (the bit to the left of the asterisk), and possible directions that the parser could take when parsing (the bit to th right of the asterisk).

Finally, when the goal is reached, the output is returned to the caller (which, by the time we're done, should be a parse tree). Said tree can then be optimised and processed for whatever purpose we desire!

With this knowledge, we can deduce that we can build the entire table by recursing over the tree of rules from the start state. That way, we'll visit every rule that we'll need to parse everything required to reach the goal state by recursing over all the rules for all the non-terminals referenced by all the rules we visit. We could even generate a warning if we detect that some rules don't connect to this 'tree'. Here's a tree of our example ruleset from the beginning of this post:

A tree diagram of the rules detailed near the beginning of this post.

It's a bit spaghetti-ish, but it should be legible enough :P This gives us an idea as to how we're going to tackle this. Taking into account the data classes we created in the last post, we need to make sure we keep the following in mind:

  1. Since the main ShiftReduceParser class is going to hold the rules, the ParseTable class will need a reference to its parent ShiftReduceParser in order to query the rules.
  2. In light of this, the SHiftReduceParser should be responsible for satisfying any queries the ParseTable has about rules - the ParseTable should not have to go looking & filtering the rule list held by ShiftReduceParser itself.
  3. ParseTable will need a recursive method that will take a single top-level rule and recurse over it and its child rules (according to the tree I've talked about above)
  4. This method in ParseTale will need to be extremely careful it doesn't get stuck in a loop. To that end, it'll have to keep track of whether it's already processed a rule or not.
  5. It'll probably also have to keep track of which configurations it has added to the table class structure we defined in the last post to avoid adding rules twice.
  6. Once ParseTable has figured out all the configurations and grouped them all into the right states, it will then have to recurse over the generated table and fill in all the shift / reduce / goal action(s) - not forgetting about the links to the other states they should point to.

It's quite the laundry list! Thankfully, most of this is quite simple if we tackle it one step at a time. The most annoying bit is the grouping of configurations into states. This is done by looking at the token immediately before the asterisk in each configuration - all the configurations with the same token here will get grouped into the same state (while there are more complex algorithms that allow for more complex grammars, we'll stick with this for now as anything else makes my head hurt! Maybe in the future I'll look as figuring out precisely what kind of LR-style parser this is, and upgrading it to be a canonical LR(1) parser - the most advanced type I know of).

This is quite a lot to take in, so I think I'll leave this post here for you to digest - and we'll get to writing some code in the next one.

Found this useful? Spotted a mistake? Having trouble getting your head around it? Post a comment below!

Shift-reduce Parser Part 1: First Steps

Now that I've done the Languages and Compilers module at University, it's opened my eyes to a much better and more extensible way of handling complex text in a way that can easily be read by any subsequent code I write. To that end, I've found that at least 3 different various projects of mine could benefit from the inclusion of a shift-reduce parser, but I haven't been able to track one down for C♯ yet.

With this in mind, I thought to myself: "Why not build one myself?" Sure, my Lecturer didn't go into too many details as to how they work, but it can't be that difficult, can it? Oh, I had no idea.....

In this mini-series, I'm going to take you through the process of building a shift-reduce parser of your very own. As I write this, I haven't actually finished mine yet - I've just got to the important milestone of building a parse table! Thankfully, that's going to be a few posts away, as there's a fair amount of ground to cover until we get to that point.

Warning: This series is not for the faint of heart! It's rather complicated, and extremely abstract - making it difficult to get your head around. I've had great difficulty getting mine around it - and ended up writing it in multiple stages. If you want to follow along, be prepared for lots of research, theory, and preparation!

Let's start out by taking a look at what a shift-reduce parser does. If you haven't already, I'd recommend reading my previous compilers 101 post, which explains how to write a compiler, and the different stages involved. I'd also recommend checking out my earlier post on building a lexer, as it ties in nicely with the shift-reduce parser that we'll be building.

An overview of how a shift-reduce works.

In short, a shift-reduce parser compiles a set of BNF-style rules into a Parse Table, which it then utilises as a sort of state-machine when parsing a stream on input tokens. We'll take a look at this table compilation process in a future blog post. In this post, let's set up some data structures to help us along when we get to the compilation process in the next blog post. Here's the class structure we'll be going for:

An overview of the class structure we'll be creating in this blog post.

Let's start with a class to represent a single token in a rule:

public enum ParserTokenClass
{
    Terminal,
    NonTerminal
}

public struct ParserToken
{
    public readonly ParserTokenClass Class;
    public readonly string Type;

    public ParserToken(ParserTokenClass inTokenType, string inType)
    {
        Class = inTokenType;
        Type = inType;
    }

    public override bool Equals(object obj)
    {
        ParserToken otherTokenType = (ParserToken)obj;
        return Class == otherTokenType.Class && Type == otherTokenType.Type;
    }
    public override int GetHashCode()
    {
        return $"{Class}:{Type}".GetHashCode();
    }

    public override string ToString()
    {
        string terminalDisplay = Class == ParserTokenClass.Terminal ? "T" : "NT";
        return $"[ParserToken {terminalDisplay}: {Type}]";
    }

    public static ParserToken NonTerm(string inType)
    {
        return new ParserToken(ParserTokenClass.NonTerminal, inType);
    }
    public static ParserToken Term(string inType)
    {
        return new ParserToken(ParserTokenClass.Terminal, inType);
    }
}

Pretty simple! A token in a rule can either be a terminal (basically a token straight from the lexer), or a non-terminal (a token that the parser reduces a set of other tokens into), and has a type - which we represent as a string. Unfortunately due to the complex comparisons we'll be doing later, it's a huge hassle to use an enum with a template class as I did in the lexer I built that I linked to earlier.

Later on (once we've built the parse table), we'll extend this class to support attaching values and other such pieces of information to it, but for now we'll leave that out to aid simplicity.

I also override Equals() and GetHashCode() in order to make comparing tokens easier later on. Overriding ToString() makes the debugging process much easier later, as we'll see in the next post!

With a class to represent a token, we also need one to represent a rule. Let's create one now:

public class ParserRule
{
    /// <summary>
    /// A function to call when a reduce operation utilises this rule.
    /// </summary>
    public Action MatchingAction;
    public ParserToken LeftSide;
    public ParserToken[] RightSideSequence;

    public ParserRule(Action inMatchingAction, ParserToken inLeftSide, params ParserToken[] inRightSideSequence)
    {
        if (inLeftSide.Class != ParserTokenClass.NonTerminal)
            throw new ArgumentException("Error: The left-hand side must be a non-terminal token type.");

        MatchingAction = inMatchingAction;
        LeftSide = inLeftSide;
        RightSideSequence = inRightSideSequence;
    }

    public bool RightSideSequenceMatches(IEnumerable<ParserToken> otherRhs)
    {
        int i = 0;
        foreach (ParserToken nextToken in otherRhs)
        {
            if (!nextToken.Equals(RightSideSequence[i]))
                return false;

           i++;
        }
        return true;
    }

    public override string ToString()
    {
        StringBuilder result = new StringBuilder();
        result.Append($"ParserRule: {LeftSide} = ");
        foreach (ParserToken nextToken in RightSideSequence)
            result.Append($" {nextToken}");
        result.Append(";");
        return result.ToString();
    }
}

The above represents a single parser rule, such as <number> ::= <digit> <number>. Here we have the token on the left-hand-side (which we make sure is a non-terminal), and an array of tokens (which can be either terminal or non-terminal) for the right-hand-side. We also have an Action (which is basically a lamba function) that we'll call when we match against the rule, so that we have a place to hook into when we write code that actually does the tree building (not to be confused with the shift-reduce parser itself).

Here I also add a method that we'll need later, which compares an array of tokens against the current rule, to see if they match - and we override ToString() here again to aid debugging.

Now that we can represent tokens and rules, we can start thinking about representing configurations and states. Not sure what these are? All will be explained in the next post, don't worry :-) For now, A state can be seen as a row in the parse table, and it contains a number of configurations - which are like routes to different other states that the parser decides between, depending where it's gotten to in the token stream.

public enum ParseTableAction
{
    Shift,
    Reduce,
    Goal,
    Error
}

public class ParseTableConfiguration
{
    public readonly ParserRule Rule;
    public readonly int RhsPosition;

    public ParseTableAction LinkingAction = ParseTableAction.Error;
    public ParseTableState LinkingState = null;

    public ParserToken TokenAfterDot {
        get {
            return Rule.RightSideSequence[RhsPosition];
        }
    }
    public ParserToken TokenBeforeDot {
        get {
            return Rule.RightSideSequence[RhsPosition - 1];
        }
    }

    /// <summary>
    /// Whether this configuration is the last in the sequence of configurations for the specified rule or not.
    /// </summary>
    /// <value><c>true</c> if is last in rule; otherwise, <c>false</c>.</value>
    public bool IsLastInRule {
        get {
            return RhsPosition > Rule.RightSideSequence.Length - 1;
        }
    }

    public ParseTableConfiguration(ParserRule inRule, int inRhsPosition)
    {
        Rule = inRule;
        RhsPosition = inRhsPosition;
    }

    public IEnumerable<ParserToken> GetParsedRhs()
    {
        return Rule.RightSideSequence.TakeWhile((ParserToken token, int index) => index <= RhsPosition);
    }

    public bool MatchesRhsSequence(ParserRule otherRule)
    {
        int i = 0;
        foreach (ParserToken nextToken in otherRule.RightSideSequence)
        {
            if (i > RhsPosition)
                break;

            if (!nextToken.Equals(otherRule.RightSideSequence[i]))
                return false;

            i++;
        }
        return true;
    }

    public override bool Equals(object obj)
    {
        ParseTableConfiguration otherConfig = obj as ParseTableConfiguration;
        if (otherConfig == null) return false;
        return Rule == otherConfig.Rule && RhsPosition == otherConfig.RhsPosition;
    }
    public override int GetHashCode()
    {
        return $"{Rule}:{RhsPosition}".GetHashCode();
    }

    public override string ToString()
    {
        StringBuilder result = new StringBuilder();

        result.Append($"Configuration: {LinkingAction} ");
        if (LinkingState != null)
            result.Append($"to State {LinkingState.Id} ");
        result.Append($"{Rule.LeftSide} = ");

        for (int i = 0; i <= Rule.RightSideSequence.Length; i++)
        {
            if (i == RhsPosition)
                result.Append(" * ");
            if (i == Rule.RightSideSequence.Length)
                continue;
            result.Append($"{Rule.RightSideSequence[i]} ");
        }
        result.Append(";");
        return result.ToString();
    }
}

This class is slightly more complicated. First, we define an enum that holds information about what the parser should do if it chooses this configuration. Then, we declare the configuration class itself. This entails specifying which parse rule we're deriving the configuration from, and both which tokens in the right-hand-side of the rule should have been parsed already, and which should still be somewhere in the token stream. Again, I'll explain this in more detail in the next post!

Then, we declare a few utility methods and properties to fetch different parts of the configuration's rule, such as the token to the immediate left and right of the right-hand-side position (which was represented as a dot . in the book I followed), all the tokens before the dot ., and whether a given rule matches this configuration in the basis of everything before the dot ..

Finally, I continue with the trend of overriding the equality checking methods and ToString(), as it makes a world of difference in the code coming up in future blog posts!

Now that we've got a class for configurations, the last one on our list is one for the states themselves. Let's do that now:

public class ParseTableState
{
    public readonly ParseTable ParentTable;

    public int Id {
        get {
            return ParentTable.FindStateId(this);
        }
    }

    public List<ParseTableConfiguration> Configurations = new List<ParseTableConfiguration>();

    public ParseTableState(ParseTable inParentTable)
    {
        ParentTable = inParentTable;
    }

    public override string ToString()
    {
        StringBuilder result = new StringBuilder();
        foreach(ParseTableConfiguration nextConfiguration in Configurations)
            result.AppendLine($"     - {nextConfiguration}");
        return result.ToString();
    }
}

Much simpler than the configuration rule class, right? :P As I mentioned earlier, all a state consists of is a list of configurations in that state. In our case, we'll be assigning an id to the states in our parse table, so I include a property here that fetches a state's id from the parent parse table that it's part to make the later code as simple as possible.

Still with me? Congratulations! You've got the beginnings of a shift-reduce parser. Next time, we'll expand on some of theory behind the things I've touched on in this post, and possibly look at building the start of the recursive parse table builder itself.

Found this interesting? Confused about something? Comment below!

Art by Mythdael