Starbeamrainbowlabs

Stardust
Blog


Archive


Mailing List Articles Atom Feed Comments Atom Feed Twitter Reddit Facebook

Tag Cloud

3d 3d printing account algorithms android announcement architecture archives arduino artificial intelligence artix assembly async audio automation backups bash batch blender blog bookmarklet booting bug hunting c sharp c++ challenge chrome os cluster code codepen coding conundrums coding conundrums evolved command line compilers compiling compression conference conferences containerisation css dailyprogrammer data analysis debugging defining ai demystification distributed computing dns docker documentation downtime electronics email embedded systems encryption es6 features ethics event experiment external first impressions freeside future game github github gist gitlab graphics guide hardware hardware meetup holiday holidays html html5 html5 canvas infrastructure interfaces internet interoperability io.js jabber jam javascript js bin labs latex learning library linux lora low level lua maintenance manjaro minetest network networking nibriboard node.js open source operating systems optimisation outreach own your code pepperminty wiki performance phd photos php pixelbot portable privacy problem solving programming problems project projects prolog protocol protocols pseudo 3d python reddit redis reference release releases rendering research resource review rust searching secrets security series list server software sorting source code control statistics storage svg systemquery talks technical terminal textures thoughts three thing game three.js tool tutorial twitter ubuntu university update updates upgrade version control virtual reality virtualisation visual web website windows windows 10 worldeditadditions xmpp xslt

MIDI to Music Box score converter

Keeping with the theme of things I've forgotten to blog about (every time I think I've mentioned everything, I remember something else), in this post I'm going to talk about yet another project of mine that's been happily sitting around (and I've nearly blogged about but then forgot on at least 3 separate occasions): A MIDI file to music box conversion script. A while ago, I bought one of these customisable music boxes:

(Above: My music box (centre), and the associated hole punching device it came with (top left).)

The basic principle is that you bunch holes in a piece of card, and then you feed it into the music box to make it play the notes you've punched into the card. Different variants exist with different numbers of notes - mine has a staggering 30 different nodes it can play!

It has a few problems though:

  1. The note names on the card are wrong
  2. Figuring out which one is which is a problem
  3. Once you've punched a hole, it's permanent
  4. Punching the holes is fiddly

Solving problem #4 might take a bit more work (a future project out-of-scope of this post), but the other 3 are definitely solvable.

MIDI is a protocol (and a file format) for storing notes from a variety of different instruments (this is just the tip of the iceberg, but it's all that's needed for this post), and it just so happens that there's a C♯ library on NuGet called DryWetMIDI that can read MIDI files in and allow one to iterate over the notes inside.

By using this and an homebrew SVG writer class (another blog post topic for another time once I've tidied it up), I implemented a command-line program I've called MusicBoxConverter (the name needs work - comment below if you can think of a better one) that converts MIDI files to an SVG file, which can then be printed (carefully - full instructions on the project's page - see below).

Before I continue, it's perhaps best to give an example of my (command line) program's output:

Example output

(Above: A lovely little tune from the TV classic The Clangers, converted for a 30 note music box by my MusicBoxConverter program. I do not own the song!)

The output from the program can (once printed to the correct size) then be cut out and pinned onto a piece of card for hole punching. I find paper clips work for this, but in future I might build myself an Arduino-powered device to punch holes for me.

The SVG generated has a bunch of useful features that makes reading it easy:

  • Each note is marked by a blue circle with a red plus sign to help with accuracy (the hole puncher that came with mine has lines on it that make a plus shape over the viewing hole)
  • Each hole is numbered to aid with getting it the right way around
  • Big arrows in the background ensure you don't get it backwards
  • The orange bar indicates the beginning
  • The blue circle should always be at the bottom and the and the red circle at the top, so you don't get it upside down
  • The green lines should be lined up with the lines on the card

While it's still a bit tedious to punch the holes, it's much easier be able to adapt a score in Musescore, export to MIDI, push it through MusicBoxConverter than doing lots of trial-and-error punching it directly unaided.

I won't explain how to use the program in this blog post in any detail, since it might change. However, the project's README explains in detail how to install and use it:

https://git.starbeamrainbowlabs.com/sbrl/MusicBoxConverter

Something worth a particular mention here is the printing process for the generated SVGs. It's somewhat complicated, but this is unfortunately necessary in order to avoid rescaling the SVG during the conversion process - otherwise it doesn't come out the right size to be compatible with the music box.

A number of improvements stand out to me that I could make to this project. While it only supports 30 note music boxes at the moment, it can easily be extended to support multiple other different type of music box (I just don't have them to hand).

Implementing a GUI is another possible improvement but would take a lot of work, and might be best served as a different project that uses my MusicBoxConverter CLI under the hood.

Finally, as I mentioned above, in a future project (once I have a 3D printer) I want to investigate building an automated hole punching machine to make punching the holes much easier.

If you make something cool with this program, please comment below! I'd love to hear from you (and I may feature you in the README of the project too).

Sources and further reading

  • Musescore - free and open source music notation program with a MIDI export function
  • MusicBoxConverter (name suggestions wanted in the comments below)

Easy Ansi Escape Code in C♯ and Javascript

I've been really rather ill over the weekend, so I wasn't able to write a post when I wanted to. I'm starting to recover now though, so I thought I'd write a quick post about a class I've written in C♯ (and later ported to Javascript via an ES6 Module) that makes using VT100 Ansi escape codes easy.

Such codes are really useful for changing the text colour & background in a terminal or console, and for moving the cursor around to draw fancy UI elements with just text.

I initially wrote it a few months ago as part of an ACW (Assessed CourseWork), but out of an abundance of caution I've held back on open-sourcing this particular code fragment that drove the code I submitted for that assessment.

Now that the assessment is done & marked though (and I've used this class in a few projects since), I feel comfortable with publishing the code I've written publically. Here it is in C♯:

(Can't see the above? Try this direct link instead)

In short, it's a static class that contains a bunch of Ansi escape codes that you can include in strings that you send out via Console.WriteLine() - no special fiddling required! Though, if you're on Windows, don't forget you need to enable escape sequences first. Here's an example that uses the C♯ class:

// ....
catch(Exception error) {
    Console.WriteLine($"{Ansi.FRed}Error: Oops! Something went wrong.");
    Console.WriteLine($"    {error.Message}{Ansi.Reset}");
    return;
}
// ....

Of course, you can get as elaborate as you like. In addition, if you need to disable all escape sequence output for some reason (e.g. you know you're writing to a log file), simply set Ansi.Enabled to false.

Some time later, I found myself needing this class in Javascript (for a Node.js application I'm pretty sure) quite badly (reason - I might blog about it soon :P). To that end, I wound up porting it from C♯ to Javascript. The process wasn't actually that painful - probably because it's a fairly small and simple class.

With porting complete, I intend to keep both versions in sync with each other as I add more features - but no promises :P

Here's the Javascript version:

(Can't see the above? Try this direct link instead)

The Javascript version isn't actually a static class as like the C♯ version, due to the way ES6 modules work. In the future, I'd like to do some science and properly figure out the ins and outs of ES6 modules and refactor this into the JS equivalent of a globally static class (or a Singleton, though in JS we don't have to worry about thread safety 'cause the only way to communicate between threads (also) in JS is through messaging).

Still, it works well enough for my purposes for now.

Found this interesting? Got an improvement? Just want to say hi? Comment below!

ASP.NET: First Impressions

Admittedly, I haven't really got too far into ASP.NET (core). I've only gone through the first few tutorials or so, and based on what I've found so far, I've decided that it warrants a full first impressions blog post.

ASP.NET is fascinating, because it takes the design goals centred around developer efficiency and combines them with the likes of PHP to provide a framework with which one can write a web-server. Such a combination makes for a promising start - providing developers with everything they need to rapidly create a web-based application that's backed by any one of a number of different types of database.

Coming part-and-parcel with the ASP.NET library comes Entity Framework. It's purpose is to provide an easy mechanism by which developers can both create and query a database. I haven't really explored it much, but it appears to perform this task well.

If I were to criticise it, I'd probably say that the existing tutorials on how to use it are far too Windows and Visual Studio-oriented. Being a Linux user, I found it somewhat of a challenge to wade though the large amount of Visual Studio-specific parts of the tutorial and piece together how it actually works - independently of the automatic code generators built-in to Visual Studio.

This criticism, I've found is a running theme throughout ASP.NET and ASP.NET Core. Even in the official tutorials (which, although they say you can use Visual Studio Code on macOS and Linux, don't actually make any accommodations for users of anything other than Visual Studio), it leans heavily on the inbuilt code and template generators - choosing to instruct you on how to make the absolute minimum amount of changes to the templates provided in order to achieve the goal of the tutorial.

This, unfortunately, leaves the reader wondering precisely how ASP.NET core works under the hood. For example, what does services.AddDefaultIdentity<IdentityUser>().AddEntityFrameworkStores<ApplicationDbContext>(); do? Or what's an IdentityUser, and how do I customise it? Why isn't ASP.NET just a NuGet package I can import? None of these things are explained.

Being the kind of person who works from the ground up, I'm increasingly finding the "all that matters is that it works" approach taken by ASP.NET to, ironically enough, ease the experience for developers new to the library, rather frustrating. For me, it's hard to work with something if I don't understand what it does and how it works - so a tutorial that leans heavily on templates and scaffolding (don't even get me started on that) confusing and unhelpful.

To an extent, I can compare my experience starting out with ASP.NET with my experience starting out with Android development in Java. Both experiences were rather painful, and both experiences were unpleasant because of the large amount of pre-generated template code.

Having said this, in Java's case, there was the additional pain from learning a new language (even if it is similar to C♯), and the irritation in keeping a constant balance between syntax errors from not catching an exception, and being unable to find a bug because it's actually an exception that's been eaten somewhere that I can't see.

Although ASP.NET doesn't have terrible exception handling rules, it does have it's fair share of issues. It's equivalent, I guess, would be the number of undocumented and difficult-to-search bugs and issues one encounters when setting it up for the first time - both on Windows (with Microsoft's own Visual Studio!) and on Linux (though, to be fair, it's only .NET Core that has issues here). The complexity of the system and the lack of decent tutorials and documentation result in a confusing and irritating experience trying to get it to work (especially on Windows).

In conclusion, I'm finding ASP.NET to be a failed attempt at bringing the amazing developer efficiency from .NET to web development, but I suspect that this is largely down to me being inexperienced with it. Hampered by unhelpful tutorials, opaque black-boxed frameworks with blurred lines between library and template (despite the fact it's apparently open-source), and heavy tie-ins with Visual Studio, I think I'll be using other technologies such as Node.js to develop web-based projects in the future.

Troubleshooting my dotnet setup

I've recently been setting up dotnet on my Artix Linux laptop for my course at University. While I'm unsure precisely what dotnet is intended to do (and how it's different to Mono), my current understanding is that it's an implementation of .NET Core intended for developing and running ASP.NET web applications (there might be more on ASP.NET in a later 'first impressions' post soon-ish).

While the distribution is somewhat esoteric (it's based on Arch Linux), I've run into a number of issues with the installation process and getting Monodevelop to detect it - and if what I've read whilst researching said issues, they aren't confined to a single operating system.

Since I haven't been able to find any concrete instructions on how to troubleshoot the installation for the specific issues I've been facing, I thought I'd blog about it to help others out.

Installation on Arch-based distributions is actually pretty easy. I did this:

sudo pacman -S dotnet-sdk

Easy!

Monodevelop + dotnet = headache?

After this, I tried opening Monodevelop - and found an ominous message saying something along the lines of ".NET Core SDK 2.2 is not installed". Strange. If I try dotnet in the terminal, I get something like this:

$ dotnet
Usage: dotnet [options]
Usage: dotnet [path-to-application]

.....

Turns out that it's a known bug. Sadly, there doesn't appear to be much interest in fixing it - and neither does there appear to be much information about how Monodevelop does actually detect a dotnet installation.

Thankfully, I've deciphered the bug report and done all the work for you :P The bug report appears to suggest that Monodevelop expects dotnet to be installed to the directory /usr/share/dotnet. My system didn't install it there, so went looking to find where it did install it to. Doing this:

whereis dotnet

Yielded just /usr/bin/dotnet. My first thought was that this was a symbolic link to the binary in the actual install directory, so I tried this to see:

ls -l /usr/bin/dotnet

Sadly, it was not to be. Instead of a symbolic link, I found instead what appeared to be a binary file itself - which could also be a hard link. Not to be outdone, I tried a more brute-force approach to find it:

sudo find / -mount -type d -name "dotnet"

Success! This gave a list of all directories on my main / root partition that are called dotnet. From there, it was easy to pick out that it actually installed it to /opt/dotnet.

Instead of moving it from the installation directory and potentially breaking my package manager, I instead opted to create a new symbolic link:

sudo ln -s /opt/dotnet /usr/share/dotnet

This fixed the issue, allowing Monodevelop to correctly detect my installation of dotnet.

Templates

Thinking my problems were over, I went to create a new dotnet project following a tutorial. Unfortunately, I ran into a number of awkward and random errors - some of which kept changing from run to run!

I created the project with the dotnet new subcommand like this:

dotnet new --auth individual mvc

Apparently, the template projects generated by the dotnet new subcommand are horribly broken. To this end, I re-created my project through Monodevelop with the provided inbuilt templates. I was met with a considerable amount more success here than I was with dotnet new.

HTTPS errors

The last issue I've run into is a large number of errors relating to the support for HTTPS that's built-in to the dotnet SDK.

Unfortunately, I haven't been able to resolve these. To this end, I disabled HTTPS support. Although this sounds like a bad idea, my reasoning is that in production, I would always have the application server itself run plain-old HTTP - and put it behind a reverse-proxy like Nginx that provides HTTPS, as this separates concerns. It also allows me to have just a single place that implements HTTPS support - and a single place that I have to constantly tweak and update to keep the TLS configuration secure.

To this end, there are 2 things you've got to do to disable HTTPS support. Firstly, in the file Startup.cs, find and comment out the following line:

app.UseHttpsRedirection();

In a production environment, you'll probably have your reverse-proxy configured to do this HTTP to HTTPS redirection anyway - another instance of separating concerns.

The other thing to do is to alter the endpoint and protocol that it listens on. Right click on the project name in the solution pane, click "Options", then "Run -> Configurations -> Default", then the "ASP.NET Core" tab, and remove the s in `https in the "App URL" box like this:

A screenshot of the above option that needs changing.

By the looks of things, you'll have to do this 2nd step on every machine you develop on - unless you also untick the "user-specific" box (careful you don't include any passwords etc. in the environment variables in the opposite tab in that case).

You may wish to consider creating a new configuration that has HTTPS disabled if you want to avoid changing the default configuration.

Found this useful? Got a related issue you've managed to fix? Comment below!

Enabling ANSI Escape Codes on Windows 10

In a piece of assessed coursework (ACW) I've done recently, I built a text-based user interface rendering engine. Great fun, but when I went to run it on Windows - all I got was garbage in the console window! I found this strange, since support was announced a year or 2 back. They've even got an extensive documentation page on the subject!

(Above: ANSI escape sequences rendering on Windows. Hooray! Source: This forum post on the Intel® Developer Zone Forums)

The problem lay in the fact that unlike Linux, you actually need to enable it by calling an unmanaged function in the Windows API and flipping an undocumented bit. Sounds terrible? It is.

Thankfully, due to the .NET runtime be it Microsoft's official implementation or Mono handles references to DLLs, it's fairly easy to write a method that flips the appropriate bit in a portable fashion, which I'd like to document in this blog post for future reference.

Firstly, let's setup a method that only executes on Windows. That's easily achieved by checking Environment.OSVersion:

if(Environment.OSVersion.Platform.ToString().ToLower().Contains("win")) {
    IConsoleConfigurer configurer = new WindowsConsoleConfiguerer();
    configurer.SetupConsole();
}

Here, we inspect the platform we're running on, and if it contains the substring win, then we can assume that we're on Windows.

Then, in order to keep the unmanaged code calls as loosely coupled and as far from the main program as possible, I've put bit-flipping code itself in a separate class and referenced it via an interface. This is probably overkill, but at least this way if I run into any further compilation issues it won't be too difficult to refactor it into a separate class library and load it via reflection.

Said interface needs only a single method:

internal interface IConsoleConfigurer
{
    void SetupConsole();
}

....I've marked this as internal, as it's not (currently) going to be used outside the assembly it's located in. If that changes in the future, I can always mark it public instead.

The implementation of this interface is somewhat more complicated. Here it is:

/// <summary>
/// Configures the console correctly so that it processes ANSI escape sequences.
/// </summary>
internal class WindowsConsoleConfiguerer : IConsoleConfigurer
{
    const int STD_OUTPUT_HANDLE = -11;
    const uint ENABLE_VIRTUAL_TERMINAL_PROCESSING = 4;

    [DllImport("kernel32.dll", SetLastError = true)]
    static extern IntPtr GetStdHandle(int nStdHandle);

    [DllImport("kernel32.dll")]
    static extern bool GetConsoleMode(IntPtr hConsoleHandle, out uint lpMode);

    [DllImport("kernel32.dll")]
    static extern bool SetConsoleMode(IntPtr hConsoleHandle, uint dwMode);

    public void SetupConsole() {
        IntPtr handle = GetStdHandle(STD_OUTPUT_HANDLE);
        uint mode;
        GetConsoleMode(handle, out mode);
        mode |= ENABLE_VIRTUAL_TERMINAL_PROCESSING;
        SetConsoleMode(handle, mode);
    }
}

In short, the DllImport attributes and the extern keywords tell the .NET runtime that the method should be imported directly from a native DLL - not a .NET assembly.

The SetupConsole() method, that's defined by the IConsoleConfigurer interface above, then calls the native methods that we've imported - and because the .NET runtime only imports DLLs when they are first utilised, it compiles and runs just fine on Linux too :D

Found this helpful? Still having issues? Got a better way of doing this? Comment below!

RhinoReminds: An XMPP reminder bot for my convenience

A Black Rhino from WikiMedia Commons. (Above: A Picture of a Black Rhino. Source: WikiMedia Commons)

Many times when I write a program it's to solve a problem. With Pepperminty Wiki, it was that I needed a lightweight wiki engine - and MediaWiki was just too complex. With TeleConsole, it was that I wanted to debug a C♯ Program in an environment that had neither a debugger nor a console (I'm talking about you, Unity 3D).

Today, I'm releasing RhinoReminds, an XMPP bot that reminds you about things. As you might have guessed, this is the end product of a few different posts I've made on here recently:

You can talk to it like so:

Remind me to water the greenhouse tomorrow at 4:03pm
Show all reminders
Delete reminders 2, 3, 4, and 7
Remind me in 1 hour to check the oven

...and it'll respond accordingly. It figures out which action to take based on the first word of the sentence you send it, but after that it uses AI (specifically Microsoft.Recognizers.Text, which I posted about here) to work out what you how you want it to do it.

I'm still working out a few bugs (namely reconnecting automagically after the connection to the server is lost, and ensuring all the messages it sends in reply actually make sense), but it's at the point now where it's stable enough that I can release it to everyone who'd either like to use it for themselves, or is simply curious :-)

If you'd like to run an instance of the bot for yourself, I recommend heading over to my personal git server here:

https://git.starbeamrainbowlabs.com/sbrl/RhinoReminds

The readme file should contain everything you need to know to get started. If not, let me know by contacting me, or commenting here!

Unfortunately, I'm not able to offer a public instance of this bot at the moment, due to concerns about spam. However, patches to improve the bots resistance against spammers (perhaps a cooldown period or something or too many messages are sent? or a limit of 50 active reminders per account?) are welcome.

Found this interesting? Got a cool use for it? Want help setting it up yourself? Comment below!

Write an XMPP bot in half an hour

Recently I've looked at using AI to extract key information from natural language, and creating a system service with systemd. The final piece of the puzzle is to write the bot itself - and that's what I'm posting about today.

Since not only do I use XMPP for instant messaging already but it's an open federated standard, I'll be building my bot on top of it for maximum flexibility.

To talk over XMPP programmatically, we're going to need library. Thankfully, I've located just such a library which appears to work well enough, called S22.XMPP. Especially nice is the comprehensive documentation that makes development go much more smoothly.

With our library in hand, let's begin! Our first order of business is to get some scaffolding in place to parse out the environment variables we'll need to login to an XMPP account.

using System;
using System.Linq;
using System.Threading;
using System.Threading.Tasks;

using S22.Xmpp;
using S22.Xmpp.Client;
using S22.Xmpp.Im;

namespace XmppBotDemo
{
    public static class MainClass
    {
        // Needed later
        private static XmppClient client;

        // Settings
        private static Jid ourJid = null;
        private static string password = null;

        public static int Main(string[] args)
        {
            // Read in the environment variables
            ourJid = new Jid(Environment.GetEnvironmentVariable("XMPP_JID"));
            password = Environment.GetEnvironmentVariable("XMPP_PASSWORD");

            // Ensure they are present
            if (ourJid == null || password == null) {
                Console.Error.WriteLine("XMPP Bot Demo");
                Console.Error.WriteLine("=============");
                Console.Error.WriteLine("");
                Console.Error.WriteLine("Usage:");
                Console.Error.WriteLine("    ./XmppBotDemo.exe");
                Console.Error.WriteLine("");
                Console.Error.WriteLine("Environment Variables:");
                Console.Error.WriteLine("    XMPP_JID         Required. Specifies the JID to login with.");
                Console.Error.WriteLine("    XMPP_PASSWORD    Required. Specifies the password to login with.");
                return 1;
            }

            // TODO: Connect here           

            return 0;
        }
    }
}

Excellent! We're reading in & parsing 2 environment variables: XMPP_JID (the username), and XMPP_PASSWORD. It's worth noting that you can call these environment variables anything you like! I chose those names as they describe their contents well. It's also worth mentioning that it's important to use environment variables for secrets passing them as command-line arguments cases them to be much more visible to other uses of the system!

Let's connect to the XMPP server with our newly read-in credentials:

// Create the client instance
client = new XmppClient(ourJid.Domain, ourJid.Node, password);

client.Error += errorHandler;
client.SubscriptionRequest += subscriptionRequestHandler;
client.Message += messageHandler;

client.Connect();

// Wait for a connection
while (!client.Connected)
    Thread.Sleep(100);

Console.WriteLine($"[Main] Connected as {ourJid}.");

// Wait forever.
Thread.Sleep(Timeout.Infinite);

// TODO: Automatically reconnect to the server when we get disconnected.

Cool! Here, we create a new instance of the XMPPClient class, and attach 3 event handlers, which we'll look at later. We then connect to the server, and then wait until it completes - and then write a message to the console. It looks like S22.Xmpp spins up a new thread, so unfortunately we can't catch any errors it throws with a traditional try-catch statement. Instead, we'll have to ensure we're really careful that we catch any exceptions we throw accidentally - otherwise we'll get disconnected!

It does appear that XmppClient catches some errors though, which trigger the Error event - so we should attach an event handler to that.

/// <summary>
/// Handles any errors thrown by the XMPP client engine.
/// </summary>
private static void errorHandler(object sender, ErrorEventArgs eventArgs) {
    Console.Error.WriteLine($"Error: {eventArgs.Reason}");
    Console.Error.WriteLine(eventArgs.Exception);
}

Before a remote contact is able to talk to our bot, they will send us a subscription request - which we'll need to either accept or reject. This is also done via an event handler. It's the SubscriptionRequest one this time:

/// <summary>
/// Handles requests to talk to us.
/// </summary>
/// <remarks>
/// Only allow people to talk to us if they are on the same domain we are.
/// You probably don't want this for production, but for developmental purposes
/// it offers some measure of protection.
/// </remarks>
/// <param name="from">The JID of the remote user who wants to talk to us.</param>
/// <returns>Whether we're going to allow the requester to talk to us or not.</returns>
public static bool subscriptionRequestHandler(Jid from) {
    Console.WriteLine($"[Handler/SubscriptionRequest] {from} is requesting access, I'm saying {(from.Domain == ourJid.Domain?"yes":"no")}");
    return from.Domain == ourJid.Domain;
}

This simply allows anyone on our own domain to talk to us. For development purposes this will offer us some measure of protection, but for production you should probably implement a whitelisting or logging system here.

The other interesting thing we can do here is send a user a chat message to either welcome them to the server, or explain why we rejected their request. To do this, we need to write a pair of utility methods, as sending chat messages with S22.Xmpp is somewhat over-complicated:

#region Message Senders

/// <summary>
/// Sends a chat message to the specified JID.
/// </summary>
/// <param name="to">The JID to send the message to.</param>
/// <param name="message">The messaage to send.</param>
private static void sendChatMessage(Jid to, string message)
{
    //Console.WriteLine($"[Bot/Send/Chat] Sending {message} -> {to}");
    client.SendMessage(
        to, message,
        null, null, MessageType.Chat
    );
}
/// <summary>
/// Sends a chat message in direct reply to a given incoming message.
/// </summary>
/// <param name="originalMessage">Original message.</param>
/// <param name="reply">Reply.</param>
private static void sendChatReply(Message originalMessage, string reply)
{
    //Console.WriteLine($"[Bot/Send/Reply] Sending {reply} -> {originalMessage.From}");
    client.SendMessage(
        originalMessage.From, reply,
        null, originalMessage.Thread, MessageType.Chat
    );
}

#endregion

The difference between these 2 methods is that one sends a reply directly to a message that we've received (like a threaded reply), and the other simply sends a message directly to another contact.

Now that we've got all of our ducks in a row, we can write the bot itself! This is done via the Message event handler. For this demo, we'll write a bot that echo any messages to it in reverse:

/// <summary>
/// Handles incoming messages.
/// </summary>
private static void messageHandler(object sender, MessageEventArgs eventArgs) {
    Console.WriteLine($"[Bot/Handler/Message] {eventArgs.Message.Body.Length} chars from {eventArgs.Jid}");
    char[] messageCharArray = eventArgs.Message.Body.ToCharArray();
    Array.Reverse(messageCharArray);
    sendChatReply(
        eventArgs.Message,
        new string(messageCharArray)
    );
}

Excellent! That's our bot complete. The full program is at the bottom of this post.

Of course, this is a starting point - not an ending point! A number of issues with this demo stand out. There isn't a whitelist, and putting the whole program in a single file doesn't sound like a good idea. The XMPP logic should probably be refactored out into a separate file, in order to keep the input settings parsing separate from the bot itself.

Other issues that probably need addressing include better error handling and more - but fixing them all here would complicate the example rather.

Edit: The code is also available in a git repository if you'd like to clone it down and play around with it :-)

Found this interesting? Got a cool use for it? Still confused? Comment below!

Complete Program

using System;
using System.Linq;
using System.Threading;
using System.Threading.Tasks;
using S22.Xmpp;
using S22.Xmpp.Client;
using S22.Xmpp.Im;

namespace XmppBotDemo
{
    public static class MainClass
    {
        private static XmppClient client;
        private static Jid ourJid = null;
        private static string password = null;

        public static int Main(string[] args)
        {
            // Read in the environment variables
            ourJid = new Jid(Environment.GetEnvironmentVariable("XMPP_JID"));
            password = Environment.GetEnvironmentVariable("XMPP_PASSWORD");

            // Ensure they are present
            if (ourJid == null || password == null) {
                Console.Error.WriteLine("XMPP Bot Demo");
                Console.Error.WriteLine("=============");
                Console.Error.WriteLine("");
                Console.Error.WriteLine("Usage:");
                Console.Error.WriteLine("    ./XmppBotDemo.exe");
                Console.Error.WriteLine("");
                Console.Error.WriteLine("Environment Variables:");
                Console.Error.WriteLine("    XMPP_JID         Required. Specifies the JID to login with.");
                Console.Error.WriteLine("    XMPP_PASSWORD    Required. Specifies the password to login with.");
                return 1;
            }

            // Create the client instance
            client = new XmppClient(ourJid.Domain, ourJid.Node, password);

            client.Error += errorHandler;
            client.SubscriptionRequest += subscriptionRequestHandler;
            client.Message += messageHandler;

            client.Connect();

            // Wait for a connection
            while (!client.Connected)
                Thread.Sleep(100);

            Console.WriteLine($"[Main] Connected as {ourJid}.");

            // Wait forever.
            Thread.Sleep(Timeout.Infinite);

            // TODO: Automatically reconnect to the server when we get disconnected.

            return 0;
        }

        #region Event Handlers

        /// <summary>
        /// Handles requests to talk to us.
        /// </summary>
        /// <remarks>
        /// Only allow people to talk to us if they are on the same domain we are.
        /// You probably don't want this for production, but for developmental purposes
        /// it offers some measure of protection.
        /// </remarks>
        /// <param name="from">The JID of the remote user who wants to talk to us.</param>
        /// <returns>Whether we're going to allow the requester to talk to us or not.</returns>
        public static bool subscriptionRequestHandler(Jid from) {
            Console.WriteLine($"[Handler/SubscriptionRequest] {from} is requesting access, I'm saying {(from.Domain == ourJid.Domain?"yes":"no")}");
            return from.Domain == ourJid.Domain;
        }

        /// <summary>
        /// Handles incoming messages.
        /// </summary>
        private static void messageHandler(object sender, MessageEventArgs eventArgs) {
            Console.WriteLine($"[Handler/Message] {eventArgs.Message.Body.Length} chars from {eventArgs.Jid}");
            char[] messageCharArray = eventArgs.Message.Body.ToCharArray();
            Array.Reverse(messageCharArray);
            sendChatReply(
                eventArgs.Message,
                new string(messageCharArray)
            );
        }

        /// <summary>
        /// Handles any errors thrown by the XMPP client engine.
        /// </summary>
        private static void errorHandler(object sender, ErrorEventArgs eventArgs) {
            Console.Error.WriteLine($"Error: {eventArgs.Reason}");
            Console.Error.WriteLine(eventArgs.Exception);
        }

        #endregion

        #region Message Senders

        /// <summary>
        /// Sends a chat message to the specified JID.
        /// </summary>
        /// <param name="to">The JID to send the message to.</param>
        /// <param name="message">The messaage to send.</param>
        private static void sendChatMessage(Jid to, string message)
        {
            //Console.WriteLine($"[Rhino/Send/Chat] Sending {message} -> {to}");
            client.SendMessage(
                to, message,
                null, null, MessageType.Chat
            );
        }
        /// <summary>
        /// Sends a chat message in direct reply to a given incoming message.
        /// </summary>
        /// <param name="originalMessage">Original message.</param>
        /// <param name="reply">Reply.</param>
        private static void sendChatReply(Message originalMessage, string reply)
        {
            //Console.WriteLine($"[Rhino/Send/Reply] Sending {reply} -> {originalMessage.From}");
            client.SendMessage(
                originalMessage.From, reply,
                null, originalMessage.Thread, MessageType.Chat
            );
        }

        #endregion
    }
}

Easy AI with Microsoft.Text.Recognizers

I recently discovered that there's an XMPP client library (NuGet) for .NET that I overlooked a few months ago, and so I promptly investigated the building of a bot!

The actual bot itself needs some polishing before I post about it here, but in writing said bot I stumbled across a perfectly brilliant library - released by Microsoft of all companies - that can be used to automatically extract common data-types from a natural-language sentence.

While said library is the underpinnings of the Azure Bot Framework, it's actually free and open-source. To that end, I decided to experiment with it - and ended up writing this blog post.

Data types include (but are not limited to) dates and times (and ranges thereof), numbers, percentages, monetary amounts, email addresses, phone numbers, simple binary choices, and more!

While it also lands you with a terrific number of DLL dependencies in your build output folder, the result is totally worth it! How about pulling a DateTime from this:

in 5 minutes

or this:

the first Monday of January

or even this:

next Monday at half past six

Pretty cool, right? You can even pull multiple things out of the same sentence. For example, from the following:

The host 1.2.3.4 has been down 3 times over the last month - the last of which was from 5pm and lasted 30 minutes

It can extract an IP address (1.2.3.4), a number (3), and a few dates and times (last month, 5pm, 30 minutes).

I've written a test program that shows it in action. Here's a demo of it working:

(Can't see the asciicast above? View it on asciinema.org)

The source code is, of course, available on my personal Git server: Demos/TextRecogniserDemo

If you can't check out the repo, here's the basic gist. First, install the Microsoft.Recognizers.Text package(s) for the types of data that you'd like to recognise. Then, to recognise a date or time, do this:

List<ModelResult> result = DateTimeRecognizer.RecognizeDateTime(nextLine, Culture.English);

The awkward bit is unwinding the ModelResult to get at the actual data. The matched text is stored in the ModelResult.Resolution property, but that's a SortedDictionary<string, object>. The interesting property inside which is value, but depending on the data type you're recognising - that can be an array too! The best way I've found to decipher the data types is to print the value of ModelResult.Resolution as a string to the console:

Console.WriteLine(result[0].Resolution.ToString());

The .NET runtime will helpfully convert this into something like this:

System.Collections.Generic.SortedDictionary`2[System.String,System.Object]

Very helpful. Then we can continue to drill down:

Console.WriteLine(result[0].Resolution["values"]);

This produces this:

System.Collections.Generic.List`1[System.Collections.Generic.Dictionary`2[System.String,System.String]]

Quite a mouthful, right? By cross-referencing this against the JSON (thanks, Newtonsoft.JSON!), we can figure out how to drill the rest of the way. I ended up writing myself a pair of little utility methods for dates and times:

public static DateTime RecogniseDateTime(string source, out string rawString)
{
    List<ModelResult> aiResults = DateTimeRecognizer.RecognizeDateTime(source, Culture.English);
    if (aiResults.Count == 0)
        throw new Exception("Error: Couldn't recognise any dates or times in that source string.");

    /* Example contents of the below dictionary:
        [0]: {[timex, 2018-11-11T06:15]}
        [1]: {[type, datetime]}
        [2]: {[value, 2018-11-11 06:15:00]}
    */

    rawString = aiResults[0].Text;
    Dictionary<string, string> aiResult = unwindResult(aiResults[0]);
    string type = aiResult["type"];
    if (!(new string[] { "datetime", "date", "time", "datetimerange", "daterange", "timerange" }).Contains(type))
        throw new Exception($"Error: An invalid type of {type} was encountered ('datetime' expected).");

    string result = Regex.IsMatch(type, @"range$") ? aiResult["start"] : aiResult["value"];
    return DateTime.Parse(result);
}

private static Dictionary<string, string> unwindResult(ModelResult modelResult)
{
    return (modelResult.Resolution["values"] as List<Dictionary<string, string>>)[0];
}

Of course, it depends on your use-case as to precisely how you unwind it, but the above should be a good starting point.

Once I've polished the bot I've written a bit, I might post about it on here.

Found this interesting? Run into an issue? Got a neat use for it? Comment below!

Markov Chains Part 4: Test Data

With a shiny-new markov chain engine (see parts 1, 2, and 3), I found that I had a distinct lack of test data to put through it. Obviously this was no good at all, so I decided to do something about it.

Initially, I started with a list of HTML colours (direct link; 8.6KiB), but that didn't produce very good output:

MarkovGrams/bin/Debug/MarkovGrams.exe markov-w --wordlist wordlists/Colours.txt --length 16
errobiartrawbear
frelecteringupsy
lectrictomadolbo
vendellorazanigh
arvanginklectrit
dighoonbottlaven
onadeestersweese
ndiu
llighoolequorain
indeesteadesomiu

I see a few problems here. Firstly, it's treating each word as it's entity, where in fact I'd like it to generate n-grams on a line-by-line basis. Thankfully, this is easy enough with my new --no-split option:

MarkovGrams/bin/Debug/MarkovGrams.exe markov-w --wordlist wordlists/Colours.txt --no-split --length 16
med carrylight b
jungin pe red dr
ureelloufts blue
uamoky bluellemo
trinaterry aupph
utatellon reep g
radolitter brast
bian reep mardar
ght burnse greep
atimson-phloungu

Hrm, that's still rather unreadable. What if we make the n-grams longer by bumping the order?

MarkovGrams/bin/Debug/MarkovGrams.exe markov-w --wordlist wordlists/Colours.txt --length 16 --order 4
on fuchsia blue 
rsity of carmili
e blossom per sp
ngel
ulean red au lav
as green yellowe
indigri
ly gray aspe
disco blus
berry pine blach

Better, but it looks like it's starting the generation process from inside the middle of words. We can fix that with my new --start-uppercase option, which ensures that each output always stars with an n-gram that begins with a capital letter. Unfortunately the wordlist is all lowercase:

air force blue
alice blue
alizarin crimson
almond
amaranth
amber
american rose
amethyst
android green
anti-flash white

This is an issue. The other problem is that with an order of 4 the choice-point ratio is dropping quite low - I got a low of just ~0.97 in my testing.

The choice-point ratio is a measure I came up with of the average number of different directions the engine could potential go in at each step of the generation process. I'd like to keep this number consistently higher than 2, at least - to ensure a good variety of output.

Greener Pastures

Rather than try fix that wordlist, let's go in search of something better. It looks like the CrossCode Wiki has a page that lists all the items in the entire game. That should do the trick! The only problem is extracting it though. Let's use a bit of bash! We can use curl to download the HTML of the page, and then xidel to parse out the text from the <a> tags inside tables. Here's what I came up with:

curl https://crosscode.gamepedia.com/Items | xidel --data - --css "table a"

This is a great start, but we've got blank lines in there, and the list isn't sorted alphabetically (not required, but makes it look nice :P). Let's fix that:

curl https://crosscode.gamepedia.com/Items | xidel --data - --css "table a" | awk "NF > 0" | sort

Very cool. Tacking wc -l on the end of the pipe chain I can we've got ourselves a list of 527(!) items! Here's a selection of input lines:

Rough Branch
Raw Meat
Shady Monocle
Blue Grass
Tengu Mask
Crystal Plate
Humming Razor
Everlasting Amber
Tracker Chip
Lawkeeper's Fist

Let's run it through the engine. After a bit of tweaking, I came up with this:

cat wordlists/Cross-Code-Items.txt | MarkovGrams/bin/Debug/MarkovGrams.exe markov-w --start-uppercase --no-split --length 16 --order 3
Capt Keboossauci
Fajiz Keblathfin
King Steaf Sharp
Stintze Geakralt
Fruisty 'olipe F
Apper's TN
Prow Peptumn's C
Rus Recreetan Co
Veggiel Spiragma
Laver's Bolden M

That's quite interesting! With a choice-point ratio of ~5.6 at an order of 3, we've got a nice variable output. If we increase the order to 4, we get ~1.5 - ~2.3:

Edgy Hoo
Junk Petal Goggl
Red Metal Need C
Samurai Shel
Echor
Krystal Wated Li
Sweet Residu
Raw Stomper Thor
Purple Fruit Dev
Smokawa

It appears to be cutting off at the end though. Not sure what we can do about that (ideas welcome!). This looks interesting, but I'm not done yet. I'd like it to work on word-level too!

Going up a level

After making some pretty extensive changes, I managed to add support for this. Firstly, I needed to add support for word-level n-gram generation. Currently, I've done this with a new GenerationMode enum.

public enum GenerationMode
{
    CharacterLevel,
    WordLevel
}

Under the hood I've just used a few if statements. Fortunately, in the case of the weighted generator, only the bottom method needed adjusting:

/// <summary>
/// Generates a dictionary of weighted n-grams from the specified string.
/// </summary>
/// <param name="str">The string to n-gram-ise.</param>
/// <param name="order">The order of n-grams to generate.</param>
/// <returns>The weighted dictionary of ngrams.</returns>
private static void GenerateWeighted(string str, int order, GenerationMode mode, ref Dictionary<string, int> results)
{
    if (mode == GenerationMode.CharacterLevel) {
        for (int i = 0; i < str.Length - order; i++) {
            string ngram = str.Substring(i, order);
            if (!results.ContainsKey(ngram))
                results[ngram] = 0;
            results[ngram]++;
        }
    }
    else {
        string[] parts = str.Split(" ".ToCharArray());
        for (int i = 0; i < parts.Length - order; i++) {
            string ngram = string.Join(" ", parts.Skip(i).Take(order)).Trim();
            if (ngram.Trim().Length == 0) continue;
            if (!results.ContainsKey(ngram))
                results[ngram] = 0;
            results[ngram]++;
        }
    }
}

Full code available here. After that, the core generation algorithm was next. The biggest change - apart from adding a setting for the GenerationMode enum - was the main while loop. This was a case of updating the condition to count the number of words instead of the number of characters in word mode:

(Mode == GenerationMode.CharacterLevel ? result.Length : result.CountCharInstances(" ".ToCharArray()) + 1) < length

A simple ternary if statement did the trick. I ended up tweaking it a bit to optimise it - the above is the end result (full code available here). Instead of counting the words, I count the number fo spaces instead and add 1. That CountCharInstances() method there is an extension method I wrote to simplify things. Here it is:

public static int CountCharInstances(this string str, char[] targets)
{
    int result = 0;
    for (int i = 0; i < str.Length; i++) {
        for (int t = 0; t < targets.Length; t++)
            if (str[i] == targets[t]) result++;
    }
    return result;
}

Recursive issues

After making these changes, I needed some (more!) test data. Inspiration struck: I could run it recipe names! They've quite often got more than 1 word, but not too many. Searching for such a list proved to be a challenge though. My first thought was BBC Food, but their terms of service disallow scraping :-(

A couple of different websites later, and I found the Recipes Wikia. Thousands of recipes, just ready and waiting! Time to get to work scraping them. My first stop was, naturally, the sitemap (though how I found in the first place I really can't remember :P).

What I was greeted with, however, was a bit of a shock:

<?xml version="1.0" encoding="UTF-8"?>
<sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<sitemap><loc>http://recipes.wikia.com/sitemap-newsitemapxml-NS_0-p1.xml</loc></sitemap>
<sitemap><loc>http://recipes.wikia.com/sitemap-newsitemapxml-NS_0-p2.xml</loc></sitemap>
<sitemap><loc>http://recipes.wikia.com/sitemap-newsitemapxml-NS_0-p3.xml</loc></sitemap>
<sitemap><loc>http://recipes.wikia.com/sitemap-newsitemapxml-NS_0-p4.xml</loc></sitemap>
<sitemap><loc>http://recipes.wikia.com/sitemap-newsitemapxml-NS_0-p5.xml</loc></sitemap>
<sitemap><loc>http://recipes.wikia.com/sitemap-newsitemapxml-NS_0-p6.xml</loc></sitemap>
<sitemap><loc>http://recipes.wikia.com/sitemap-newsitemapxml-NS_0-p7.xml</loc></sitemap>
<sitemap><loc>http://recipes.wikia.com/sitemap-newsitemapxml-NS_0-p8.xml</loc></sitemap>
<sitemap><loc>http://recipes.wikia.com/sitemap-newsitemapxml-NS_0-p9.xml</loc></sitemap>
<sitemap><loc>http://recipes.wikia.com/sitemap-newsitemapxml-NS_14-p1.xml</loc></sitemap>
<sitemap><loc>https://services.wikia.com/discussions-sitemap/sitemap/3355</loc></sitemap>
</sitemapindex>
<!-- Generation time: 26ms -->
<!-- Generation date: 2018-10-25T10:14:26Z -->

Like who has a sitemap of sitemaps, anyways?! We better do something about this: Time for some more bash! Let's start by pulling out those sitemaps.

curl http://recipes.wikia.com/sitemap-newsitemapxml-index.xml | xidel --data - --css "loc"

Easy peasy! Next up, we don't want that bottom one - as it appears to have a bunch of discussion pages and other junk in it. Let's strip it out before we even download it!

curl http://recipes.wikia.com/sitemap-newsitemapxml-index.xml | xidel --data - --css "loc" | grep -i NS_0

With a list of sitemaps extract from the sitemap (completely coconuts I tell you) extracted, we need to download them all in turn and extract the page urls therein. This is, unfortunately, where it starts to get nasty. While a simple xargs call downloads them all easily enough (| xargs -n1 -I{} curl "{}" should do the trick), this outputs them all to stdout, and makes it very difficult for us to parse them.

I'd like to avoid shuffling things around on the file system if possible, as this introduces further complexity. We're not out of options yet though, as we can pull a subshell out of our proverbial hat:

curl http://recipes.wikia.com/sitemap-newsitemapxml-index.xml | xidel --data - --css "loc" | grep -i NS_0 | xargs -n1 -I{} sh -c 'curl {} | xidel --data - --css "loc"'

Yay! Now we're getting a list of urls to all the pages on the entire wiki:

http://recipes.wikia.com/wiki/Mexican_Black_Bean_Soup
http://recipes.wikia.com/wiki/Eggplant_and_Roasted_Garlic_Babakanoosh
http://recipes.wikia.com/wiki/Bathingan_bel_Khal_Wel_Thome
http://recipes.wikia.com/wiki/Lebanese_Tabbouleh
http://recipes.wikia.com/wiki/Lebanese_Hummus_Bi-tahini
http://recipes.wikia.com/wiki/Baba_Ghannooj
http://recipes.wikia.com/wiki/Lebanese_Falafel
http://recipes.wikia.com/wiki/Lebanese_Pita_Bread
http://recipes.wikia.com/wiki/Kebab_Koutbane
http://recipes.wikia.com/wiki/Moroccan_Yogurt_Dip

One problem though: We want recipes names, not urls! Let's do something about that. Our next special guest that inhabits our bottomless hat is the illustrious sed. Armed with the mystical power of find-and-replace, we can make short work of these urls:

... | sed -e 's/^.*\///g' -e 's/_/ /g'

The rest of the command is omitted for clarity. Here I've used 2 sed scripts: One to strip everything up to the last forward slash /, and another to replace the underscores _ with spaces. We're almost done, but there are a few annoying hoops left to jump through. Firstly, there are A bunch of unfortunate escape sequences lying around (I actually only discovered this when the engine started spitting out random ones :P). Also, there are far too many page names that contain the word Nutrient, oddly enough.

The latter is easy to deal with. A quick grep sorts it out:

... | grep -iv "Nutrient"

The former is awkward and annoying. As far as I can tell, there's no command I can call that will decode escape sequences. To this end, I wound up embedding some Python:

... | python -c "import urllib, sys; print urllib.unquote(sys.argv[1] if len(sys.argv) > 1 else sys.stdin.read()[0:-1])"

This makes the whole thing much more intimidating that it would otherwise be. Lastly, I'd really like to sort the list and save it to a file. Compared to the above, this is chicken feed!

| sort >Dishes.txt

And there we have it. Bash is very much like lego bricks when you break it down. The trick is to build it up step-by-step until you've got something that does what you want it to :)

Here's the complete command:

curl http://recipes.wikia.com/sitemap-newsitemapxml-index.xml | xidel --data - --css "loc" | grep -i NS_0 | xargs -n1 -I{} sh -c 'curl {} | xidel --data - --css "loc"' | sed -e 's/^.*\///g' -e 's/_/ /g' | python -c "import urllib, sys; print urllib.unquote(sys.argv[1] if len(sys.argv) > 1 else sys.stdin.read()[0:-1])" | grep -iv "Nutrient" | sort >Dishes.txt

After all that effort, I think we deserve something for our troubles! With ~42K(!) lines in the resulting file (42,039 to be exact as of the last time I ran the monster above :P), the output (after some tweaking, of course) is pretty sweet:

cat wordlists/Dishes.txt | mono --debug MarkovGrams/bin/Debug/MarkovGrams.exe markov-w --words --start-uppercase --length 8
Lemon Lime Ginger
Seared Tomatoes and Summer Squash
Cabbage and Potato au
Endive stuffed with Lamb and Winter Vegetable
Stuffed Chicken Breasts in Yogurt Turmeric Sauce with
Blossoms on Tomato
Morning Shortcake with Whipped Cream Graham
Orange and Pineapple Honey
Mango Raspberry Bread
Tempura with a Southwestern
Rice Florentine with
Cabbage Slaw with Pecans and Parmesan
Pork Sandwiches with Tangy Sweet Barbecue
Tea with Lemongrass and Chile Rice
Butterscotch Mousse Cake with Fudge
Fish and Shrimp -
Cucumber Salad with Roast Garlic Avocado
Beans in the Slow
Apple-Cherry Vinaigrette Salad
California Avocado Chinese Chicken Noodle Soup with Arugula

...I really need to do something about that cutting off issue. Other than that, I'm pretty happy with the result! The choice-point ratio is really variable, but most of the time it's hovering around ~2.5-7.5, which is great! The output if I lower the order from 3 to 2 isn't too bad either:

Salata me Htapodi kai
Hot 'n' Cheese Sandwich with Green Fish and
Poisson au Feuilles de Milagros
Valentines Day Cookies with Tofu with Walnut Rice
Up Party Perfect Turkey Tetrazzini
Olives and Strawberry Pie with Iceberg Salad with
Mashed Sweet sauce for Your Mood with Dried
Zespri Gold Corn rice tofu, and Corn Roasted
California Avocado and Rice Casserole with Dilled Shrimp
Egyptian Tomato and Red Bell Peppers, Mango Fandango

This gives us a staggering average choice-point ratio of ~125! Success :D

One more level

After this, I wanted to push the limits of the engine, so see what it's capable of. The obvious choice here is Shakespeare's Complete Works (~5.85MiB). Pushing this through the engine required some work, as ~30 seconds is far too slow - namely optimising the pipeline as much as possible.

The Mono Profiler helped a lot here. With it, I discovered that string.StartsWith() is really slow. Like, ridiculously slow (though this is relative, since I'm calling it hundreds of thousand of times), as it's culture-aware. In our case, we can't be bothering with any of that, as it's not relevant anyway. The easiest solution is to write another extension method:

public static bool StartsWithFast(this string str, string target) {
    if (str.Length < target.Length) return false;
    return str.Substring(0, target.Length) == target;
}

string.Substring() is faster, so by utilising this instead of the regular string.StartsWith() yields us a perfectly gigantic boost! Next up, I noticed that I can probably parallelize the Linq query that builds the list of possible n-grams we can choose from next, so that it runs on all the CPU cores:

Parallel.ForEach(ngrams, (KeyValuePair<string, double> ngramData) => {
    if (!ngramData.Key.StartsWithFast(nextStartsWith)) return;
    if (!convNextNgrams.TryAdd(ngramData.Key, ngramData.Value))
        throw new Exception("Error: Failed to add to staging ngram concurrent dictionary");
});

Again, this netted a another huge gain. With this and a few other architectural changes, I was able to chop the time down to a mere ~4 seconds (for a standard 10 words)! In the future, I might experiment with selective unmanaged code via the unsafe keyword to see if I can do any better.

For now, it's fast enough to enjoy some random Shakespeare on-demand:

What should they since that to that tells me Nero
He doth it turn and and too, gentle
Ha! you shall not come hither; since that to provoke
ANTONY. No further, sir; a so, farewell, Sir
Bona, joins with
Go with your fingering,
From fairies and the are like an ass which is
The very life-blood of our blood, no, not with the
Alas, why is here-in which
Transform'd and weak'ned? Hath Bolingbroke

Very interesting. The choice-point ratios sit at ~20 and ~3 for orders 3 and 4 respectively, though I got as high as 188 for an order of 3 during my testing. Definitely plenty of test data here :P

Conclusion

My experiments took me to several other places - which, if I included them all here, would result in an post much much longer than this! I scripted the download of several other wordlists in download.sh (direct link, 4.2KiB), if you're interested, with ready-downloaded copies in the wordlists folder of the repository.

I would advise reading the table in the README that gives credit to where I sourced each list, because of course I didn't put any of them together myself - I just wrote the script :P

Particularly of note is the Starbound list, which contains a list of all the blocks and items from the game Starbound. I might post about that one separately, as it ended up being a most interesting adventure.

In the future, I'd like to look at implementing a linguistic drift algorithm, to try and improve the output of the engine. The guy over at Here Dragons Abound has a great post on the subject, which you should definitely read if you're interested.

Found this interesting? Got an idea for another wordlist I can push though my engine? Confused by something? Comment below!

Sources and Further Reading

Disassembling .NET Assemblies with Mono

As part of the Component-Based Architectures module on my University course, I've been looking at what makes the .NET ecosystem tick, and how .NET assemblies (i.e. .NET .exe / .dll files) are put together. In the process, we looked as disassembling .NET assemblies into the text-form of the Common Intermediate Language (CIL) that they contain. The instructions on how to do this were windows-specific though - so I thought I'd post about the process on Linux and other platforms here.

Our tool of choice will be Mono - but before we get to that we'll need something to disassemble. Here's a good candidate for the role:

using System;

namespace SBRL.Demo.Disassembly {
    static class Program {
        public static void Main(string[] args) {
            int a = int.Parse(Console.ReadLine()), b = 10;
            Console.WriteLine(
                "{0} + {1} = {2}",
                a, b,
                a + b
            );
        }
    }
}

Excellent. Let's compile it:

csc Program.cs

This should create a new Program.exe file in the current directory. Before we get to disassembling it, it's worth mentioning how the compilation and execution process works in .NET. It's best explained with the aid of a diagram:

Left-to-right flowchart: Multiple source languages get compiled into Common Intermediate Language, which is then executed by an execution environment.

As is depicted in the diagram above, source code in multiple languages get compiled (maybe not with the same compiler, of course) into Common Intermediate Language, or CIL. This CIL is then executed in an Execution Environment - which is usually a virtual machine (Nope! not as in Virtual Box and KVM. It's not a separate operating system as such, rather than a layer of abstraction), which may (or may not) decide to compile the CIL down into native code through a process called JIT (Just-In-Time compilation).

It's also worth mentioning here that the CIL code generated by the compiler is in binary form, as this take up less space and is (much) faster for the computer to operate on. After all, CIL is designed to be efficient for a computer to understand - not people!

We can make it more readable by disassembling it into it's textual equivalent. Doing so with Mono is actually quite simple:

monodis Program.exe >Program.il

Here I redirect the output to a file called Program.il for convenience, as my editor has a plugin for syntax-highlighting CIL. For those reading without access to Mono, here's what I got when disassembling the above program:

.assembly extern mscorlib
{
  .ver 4:0:0:0
  .publickeytoken = (B7 7A 5C 56 19 34 E0 89 ) // .z\V.4..
}
.assembly 'Program'
{
  .custom instance void class [mscorlib]System.Runtime.CompilerServices.CompilationRelaxationsAttribute::'.ctor'(int32) =  (01 00 08 00 00 00 00 00 ) // ........

  .custom instance void class [mscorlib]System.Runtime.CompilerServices.RuntimeCompatibilityAttribute::'.ctor'() =  (
        01 00 01 00 54 02 16 57 72 61 70 4E 6F 6E 45 78   // ....T..WrapNonEx
        63 65 70 74 69 6F 6E 54 68 72 6F 77 73 01       ) // ceptionThrows.

  .custom instance void class [mscorlib]System.Diagnostics.DebuggableAttribute::'.ctor'(valuetype [mscorlib]System.Diagnostics.DebuggableAttribute/DebuggingModes) =  (01 00 07 01 00 00 00 00 ) // ........

  .hash algorithm 0x00008004
  .ver  0:0:0:0
}
.module Program.exe // GUID = {D6162DAD-AD98-45B3-814F-C646C6DD7998}

.namespace SBRL.Demo.Disassembly
{
  .class private auto ansi beforefieldinit Program
    extends [mscorlib]System.Object
  {

    // method line 1
    .method public static hidebysig 
           default void Main (string[] args)  cil managed 
    {
        // Method begins at RVA 0x2050
    .entrypoint
    // Code size 47 (0x2f)
    .maxstack 5
    .locals init (
        int32   V_0,
        int32   V_1)
    IL_0000:  nop 
    IL_0001:  call string class [mscorlib]System.Console::ReadLine()
    IL_0006:  call int32 int32::Parse(string)
    IL_000b:  stloc.0 
    IL_000c:  ldc.i4.s 0x0a
    IL_000e:  stloc.1 
    IL_000f:  ldstr "{0} + {1} = {2}"
    IL_0014:  ldloc.0 
    IL_0015:  box [mscorlib]System.Int32
    IL_001a:  ldloc.1 
    IL_001b:  box [mscorlib]System.Int32
    IL_0020:  ldloc.0 
    IL_0021:  ldloc.1 
    IL_0022:  add 
    IL_0023:  box [mscorlib]System.Int32
    IL_0028:  call void class [mscorlib]System.Console::WriteLine(string, object, object, object)
    IL_002d:  nop 
    IL_002e:  ret 
    } // end of method Program::Main

    // method line 2
    .method public hidebysig specialname rtspecialname 
           instance default void '.ctor' ()  cil managed 
    {
        // Method begins at RVA 0x208b
    // Code size 8 (0x8)
    .maxstack 8
    IL_0000:  ldarg.0 
    IL_0001:  call instance void object::'.ctor'()
    IL_0006:  nop 
    IL_0007:  ret 
    } // end of method Program::.ctor

  } // end of class SBRL.Demo.Disassembly.Program
}

Very interesting. There are a few things of note here:

  • The metadata at the top of the CIL tells the execution environment a bunch of useful things about the assembly, such as the version number, the classes contained within (and their signatures), and a bunch of other random attributes.
  • An extra .ctor method has been generator for us automatically. It's the class' constructor, and it automagically calls the base constructor of the object class, since all classes are descended from object.
  • The ints a and b are boxed before being passed to Console.WriteLine. Exactly what this does and why is quite complicated, and best explained by this Stackoverflow answer.
  • We can deduce that CIL is a stack-based language form the add instruction, as it has no arguments.

I'd recommend that you explore this on your own with your own test programs. Try changing things and see what happens!

  • Try making the Program class static
  • Try refactoring the int.Parse(Console.ReadLine()) into it's own method. How is the variable returned?

This isn't all, though. We can also recompile the CIL back into an assembly with the ilasm code:

ilasm Program.il

This makes for some additional fun experiments:

  • See if you can find where b's value is defined, and change it
  • What happens if you alter the Console.WriteLine() format string so that it becomes invalid?
  • Can you get ilasm to reassemble an executable into a .dll library file?

Found this interesting? Discovered something cool? Comment below!

Sources and Further Reading

Art by Mythdael