Stardust | Starbeamrainbowlabs

applause-cli: A Node.js CLI handling library

Continuing in the theme of things I've forgotten to talk about, I'd like to post about another package I've released a little while ago. I've been building a number of command line interfaces for my PhD, so I thought it would be best to use a library for this function.

I found [clap](), but it didn't quite do what I wanted - so I wrote my own inspired by it. Soon enough I needed to use the code in several different projects, so I abstracted the logic for it out and called it applause-cli, which you can now find on npm.

It has no dependencies, and it allows you do define a set of arguments and have it parsed out the values from a given input array of items automatically. Here's an example of how it works:

import Program from 'applause-cli';

let program = new Program("path/to/package.json");
program.argument("food", "Specifies the food to find.", "apple")
    .argument("count", "The number of items to find", 1, "number");

program.parse(process.argv.slice(2)); // Might return { food: "banana", count: 6 }

I even have automated documentation generated with documentation and uploaded to my website via Continuous Integration: https://starbeamrainbowlabs.com/code/applause-cli/. I've worked pretty hard on the documentation for this library actually - it even has integrated examples to show you how to use each function!

The library can also automatically generate help output from the provided information when the --help argument is detected too - though I have yet to improve the output if a subcommand is called (e.g. mycommand dostuff --help) - this is on my todo list :-)

Here's an example of the help text it automatically generates:

If this looks like something you'd be interested in using, I recommend checking out the npm package here: https://www.npmjs.com/package/applause-cli

For the curious, applause-cli is open-source under the MPL-2.0 licence. Find the code here: https://github.com/sbrl/applause-cli.

PhD Aside: Reading a file descriptor line-by-line from multiple Node.js processes

Phew, that's a bit of a mouthful. We're taking a short break from the cluster series of posts (though those will be back next week I hope), because I've just run into a fascinating problem, the solution to which I thought I'd share here - since I didn't find a solution elsewhere on the web.

For my PhD, I've got a big old lump of data, and it all needs preprocessing before I train an AI model (or a variant thereof, since I'm effectively doing video-to-image translation). Unfortunately, one of the preprocessing steps is really slow. And because I'll naturally be training my AI for multiple epochs, the problem is multiplied.....

The solution, of course, is to do all the preprocessing up front such that I can just read the data in and push it directly into a Tensor in the right format. However, doing this on such a large dataset would take forever if I did the items 1 by 1. The thing is that Javascript isn't inherently multithreaded. I like this quote, as it describes the situation rather well:

In Javascript everything runs in parallel... except your code

--Felix Geisendörfer

In other words, when Node.js is reading or writing to and from the network, disk, or other places it can do lots of things at the same time because it does them asynchronously. The Javascript that gets executed though is only done on a single thread though.

This is great for io-bound tasks (such as a web server), as Node.js (a Javascript runtime) can handle many requests at the same time. On a side note, this is also the reason why Nginx is more efficient than Apache (because Nginx is event based too like Javascript, unlike Apache which is thread based).

It's not so great though for CPU bound tasks, such as the one I've got on my hands. All is not lost though, because Node.js has a number of useful functions inbuilt that we can use to tackle the issue.

Firstly, Node.js has a clever forking system. By using child_process.fork(), a single Node.js process can create multiple copies of itself to act as workers:

// main.js
import child_process from 'child_process';
import os from 'os';

let workers = [];

for(let i = 0; i &lt; os.cpus().length; i++) {
    workers.push(
        child_process.fork("worker.mjs")
    );
}

// worker.js
console.log(`Hello, world from a child process!`);

Very useful! The next much more sticky problem though is how to actually preprocess the data in a performant manner. In my specific case, I'm piping the data in from a shell script that decompresses a number of gzip archives in a specific order (as of the time of typing I have yet to implement this).

Because this is a single pipe we're talking about here, the question now arises of how to allow all the child processes to access the data that's coming in from the standard input of the master process.

I've actually encountered an issue like this one before. I initially tried reading it in on the master process, and then using worker.send(message) to send it to the worker processes for processing. This didn't end up working very well, because the master process became a bottleneck as it couldn't read from the standard input and send stuff to the workers fast enough.

With this in mind, I came up with a new plan. In Node.js, when you're forking to create a worker process, you can supply it with some custom file descriptors upon initialisation. So long as it has at least IPC (inter-process communication) channel for passing messages back and forth with the .send() and .on("message", (message) => ....) method and listeners, it doesn't actually care what you do with the others.

Cue file descriptor cloning:


// main.js
import child_process from 'child_process';
import os from 'os';

let workers = [];

for(let i = 0; i

I've highlighted the key line here (line 10 for those who can't see it). Here we tell it to clone file descriptors 0, 1, and 2 - which refer to stdin, stdout, and stderr respectively. This allows the worker processes direct access to the master process' stdin, stdout, and stderr.

With this, we can read from the same pipe with as many worker processes as we like - so long as they do so 1 at a time.

With this sorted, it gives rise to the next issue: reading line-by-line. Packages exist on npm (such as nexline, my personal favourite) to read from a stream line-by-line, but they have the unfortunate side-effect of maintaining a read buffer. While this is great for performance, it's not so great in my situation because it ends up scrambling the input! This is because said read buffer would be local to each worker process, so when the next worker along reads, it will skip a random number of bytes and start reading from the next bit along.

This means that I need to implement a custom method that reads a single line from a given file descriptor without maintaining a read buffer. I came up with this:

import fs from 'fs';

//  .....

// Global buffer to avoid unnecessary memory churn
let buffer = Buffer.alloc(4096);
function read_line_unbuffered(fd) {
    let i = 0;
    while(true) {
        let bytes_read = fs.readSync(fd, buffer, i, 1);
        if(bytes_read !== 1 || buffer[i] == 0x0A) {
            if(i == 0 && bytes_read == null) return null;
            return buffer.toString("utf-8", 0, i); // This is not inclusive, so we can abuse it to trim the \n off the end
        }

        i++;
        if(i == buffer.length) {
            let new_buffer = new Buffer(Math.ceil(buffer.length * 1.5));
            buffer.copy(new_buffer);
            buffer = new_buffer;
        }
    }
}

I read from the given file descriptor character by character directly into a buffer. As soon as it detects a new line character (\n, or character code 0x0A), it returns the new line. If we run out of space in the buffer, then we create a new larger one, copy the old buffer's contents into it, and keep going.

I maintain a global buffer here, because this helps to avoid unnecessary memory churn. In my case, the lines I'm reading in a rather long (hence the need to clone the file descriptor in the first place), and if I didn't keep a shared buffer I'd be allocating and deallocating a new pretty large buffer every time.

This also has the nice side-effect that we keep the largest buffer we've had to use so far around for next time, avoiding the need for subsequent copies to larger and larger buffers.

Finally, we can also guarantee that it won't be a problem if we call this multiple times, because as I explained above Javascript is single-threaded, so if we call the function multiple times in quick succession each read will happen 1 after another.

With this chain of Node.js features, we can read a large amount of data from and efficiently process the content of a pipe. The trick from here is to implement a proper messaging and locking system to avoid reading from the stream at the same time, and avoid write to the standard output at the same time.

Taking this further, I ended up with this:

(Licence: Mozilla Public Licence 2.0)

This correctly ensures that only 1 worker process reads from the stream at the same time. It doesn't do anything with the result though except log a message to the console, but when I implement that I'll implement a similar messaging system to ensure that only 1 process writes to the output at once.

On that note, my data is also ordered, so I'll have to implement a complicated cache system // ordering system to ensure that I write them to the standard output in the same order I read them in. When I do implement that, I'll probably blog about that too....

The main problem I still have with this solution is that I'm reading from the input stream. I haven't done any proper testing, but I'm pretty sure that doing so will be really slow. I not sure I can avoid this though and read a few KiBs at a time, because I don't currently know of any way to put the extra characters back into the input stream.

If anyone has a solution to that that increases performance, I'd love to know. Leave a comment below!

The legend of the disappearing data in Node.js

Happy leap day! :D

A green tree frog :D

_(Above: A nice green tree frog - source)_

Recently, I've been doing a bunch of work in Node.js streaming large amounts of data. For the most part the experience has been highly pleasurable, as Node.js makes it so easy! I have encountered a few pain points though, the most significant of which I'd like to talk about here.

In Node.js, streams come in 3 main forms:

Readable Streams
Writable Streams
Transform Streams

In addition, you can either plug streams together with the .pipe() method, write to them directly with the .write() method, or any combination thereof - allowing you to build up a chain of streams that enables data to flow through your program.

The problems start when you try and write large amounts of data to a stream directly:

import fs from 'fs';

import do_work from 'somewhere';
import get_some_stream from 'somewhere_else';

let stream_in = get_some_stream();
let out = fs.createWriteStream("/tmp/test.txt");
for(let i = 0; i < 1000000; i++) {
    out.write(do_work(stream_in, i))
}

(Above: Just an example of writing lots of data to a stream)

When this happens, you start to lose random chunks of data. The reason for this is not obvious, but it is buried in the Node.js docs:

The writable.write() method writes some data to the stream, and calls the supplied callback once the data has been fully handled. If an error occurs, the callback may or may not be called with the error as its first argument. To reliably detect write errors, add a listener for the 'error' event.

This is a huge pain. It means that you have to wrap all write calls like this:

"use strict";

/**
 * Writes data to a stream, automatically waiting for the drain event if asked.
 * @param   {stream.Writable}           stream_out  The writable stream to write to.
 * @param   {string|Buffer|Uint8Array}  data        The data to write.
 * @return  {Promise}   A promise that resolves when writing is complete.
 */
function write_safe(stream_out, data) {
    return new Promise((resolve, reject) => {
        // Handle errors
        let handler_error = (error) => {
            stream_out.off("error", handler_error);
            reject(error);
        };
        stream_out.on("error", handler_error);

        if(stream_out.write(data)) {
            // We're good to go
            stream_out.off("error", handler_error);
            resolve();
        }
        else {
            // We need to wait for the drain event before continuing
            stream_out.once("drain", () => {
                stream_out.off("error", handler_error);
                resolve();
            });
        }
    });
}

export { write_safe };

Such a huge boilerplate for such a simple task! Basically, if the .write() method returns false, you have to wait until the drain event is fired on the writeable stream before continuing to write to the stream. The reason for this I think is that it signals that the write buffer is full, and it needs to be drained before writing can continue.

This is ok, but it would be nice if this was abstracted away behind a single method, such as the wrapper I've shown above. Something like a async stream.Writable.writeAsync() would be great, but it doesn't currently exist.

I think I'm going to open an issue about it - since it seems very doable and just silly that it doesn't exist already.

Summer Project Series List

At this point, it is basically the end of my summer project series - at least for a while while I start my PhD (more on that in a future post). To this end, I'm releasing the series list for it.

Powahroot: Client and Server-side routing in Javascript

If I want to really understand something, I usually end up implementing it myself. This is the case with my latest library - powahroot, but also because I didn't really like the way any of the alternatives functioned because I'm picky.

Originally I wrote it for this project (although it's actually for a little satellite project that isn't open-source unfortunately - maybe at some point in the future!) - but I liked it so much that I decided that I had to turn it into a full library that I could share here.

In short, a routing framework helps you get requests handled in the right places in your application. I've actually blogged about this before, so I'd recommend you go and read that post first before continuing with this one.

For all the similarities between the server side (as mentioned in my earlier post) and the client side, the 2 environments are different enough that they warrant having 2 distinctly separate routers. In powahroot, I provide both a ServerRouter and a ClientRouter.

The ServerRouter is designed to handle Node.js HTTP request and response objects. It provides shortcut methods .get(), .post(), and others to quickly create routes for different request types - and also supports middleware to enable logical separation of authentication, request processing, and response generation.

The ClientRouter, on the other hand, is essentially a stripped-down version of the ServerRouter that's tailored to functioning in a browser environment. It doesn't support middleware (yet?), but it does support the pushstate that's part of the History API.

I've also published it on npm, so you can install it like this:

npm install --save powahroot

Then you can use it like this:

# On the server
import ServerRouter from 'powahroot/Server.mjs';

// ....

const router = new ServerRouter();
router.on_all(async (context, next) => { console.debug(context.url); await next()})
router.get("/files/::filepath", (context, _next) => context.send.plain(200, `You requested ${context.params.filepath}`));
// .....

# On the client
import ClientRouter from 'powahroot/Client.mjs';

// ....

const router = new ClientRouter({
    // Options object. Default settings:
    verbose: false, // Whether to be verbose in console.log() messages
    listen_pushstate: true, // Whether to react to browser pushstate events (excluding those generated by powahroot itself, because that would cause an infinite loop :P)
});

As you can see, powahroot uses ES6 Modules, which makes it easy to split up your code into separate independently-operating sections.

In addition, I've also generated some documentation with the documentation tool on npm. It details the API available to you, and should serve as a good reference when using the library.

You can find that here: https://starbeamrainbowlabs.com/code/powahroot/docs/

It's automatically updated via continuous integration and continuous deployment, which I really do need to get around to blogging about (I've spent a significant amount of time setting up the base system upon which powahroot's CI and CD works. In short I use Laminar CI and a GitHub Webhook, but there's a lot of complicated details).

Found this interesting? Used it in your own project? Got an idea to improve powahroot? Comment below!

Easy Node.JS Dependencies Updates

Once you've had a project around for a while, it's inevitable that dependency updates will become available. Unfortunately, npm (the Node Package Manager), while excellent at everything else, is completely terrible at notifying you about updates.

The solution to this is, of course, to use an external tool. Personally, I use npm-check, which is also installable via npm. It shows you a list of updates to your project's dependencies, like so:

(Can't see the above? View it directly on asciinema.org)

It even supports the packages that you've install globally too, which no other tool appears to do as far as I can tell (although it does appear to miss some packages, such as npm and itself). To install it, simply do this:

sudo npm install --global npm-check

Once done, you can then use it like this:

# List updates, but don't install them
npm-check
# Interactively choose the updates to install
npm-check -u
# Interactively check the globally-installed packages
sudo npm-check -gu

The tool also checks to see which of the dependencies are actually used, and prompts you to check the dependencies it think you're not using (it doesn't always get it right, so check carefully yourself before removing!). There's an argument to disable this behaviour:

npm-check --skip-unused

Speaking of npm package dependencies, the other big issue is security vulnerabilities. GitHub have recently started giving maintainers of projects notifications about security vulnerabilities in their dependencies, which I think is a brilliant idea.

Actually fixing said vulnerabilities is a whole other issue though. If you don't want to update all the dependencies of a project to the latest version (perhaps you're just doing a one-off contribution to a project or aren't very familiar with the codebase yet), there's another tool - this time built-in to npm - to help out - the npm audit subcommand.

# Check for reported security issues in your dependencies
npm audit
# Attempt to fix said issues by updating packages *just* enough
npm audit fix

(Can't see the above? View it directly on asciinema.org)

This helps out a ton with contributing to other projects. Issues arise when the vulnerabilities are not in packages you directly depend on, but instead in packages that you indirectly depend on via dependencies of the packages you've installed.

Thankfully the vulnerabilities in the above can all be traced back to development dependencies (and aren't essential for Peppermint Wiki itself), but it's rather troubling that I can't install updated packages because the packages I depend on haven't updated their dependencies.

I guess I'll be sending some pull requests to update some dependencies! To help me in this, the output of npm audit even displays the dependency graph of why a package is installed. If this isn't enough though, there's always the npm-why which, given a package name, will figure out why it's installed.

Found this interesting? Got a better solution? Comment below!

Bridging the gap between XMPP and shell scripts

In a previous post, I set up a semi-automated backup system for my Raspberry Pi using duplicity, sendxmpp, and an external drive. It's been working fabulously for a while now, but unfortunately the other week sendxmpp suddenly stopped working with no obvious explanation. Given the long list of arguments I had to pass it:

sendxmpp --file "${xmpp_config_file}" --resource "${xmpp_resource}" --tls --chatroom "${xmpp_target_chatroom}" ...........

....and the fact that I've had to tweak said arguments on a number of occasions, I thought it was time to switch it out for something better suited to the task at hand.

Unfortunately, finding such a tool proved to be a challenge. I even asked on Reddit - but nobody had anything that fit the bill (xmpp-bridge wouldn't compile correctly - and didn't support multi-user chatrooms anyway, and xmpppy was broken too).

If you're unsure as to what XMPP is, I'd recommend checkout out either this or this tutorial. They both give a great introduction to what it is, what it does, and how it works - and the rest of this post will make much more sense if you read that first :-)

To this end, I finally gave in and wrote my own tool, which I've called xmppbridge. It's a global Node.JS script that uses the simple-xmpp to forward the standard input to a given JID over XMPP - which can optionally be a group chat.

In this post, I'm going to look at how I put it together, some of the issues I ran into along the way, and how I solved them. If you're interested in how to install and use it, then the package page on npm will tell you everything you need to know:

xmppbridge on npm

Architectural Overview

The script consists of 3 files:

index.sh - Calls the main script with ES6 modules enabled
index.mjs - Parses the command-line arguments and environment variables out, and provides a nice CLI
XmppBridge.mjs - The bit that actually captures input from stdin and sends it via XMPP

Let's look at each of these in turn - starting with the command-line interface.

CLI Parsing

The CLI itself is relatively simple - and follows a paradigm I've used extensively in C♯ (although somewhat modified of course to get it to work in Node.JS, and without fancy ANSI colouring etc.).

#!/usr/bin/env node
"use strict";

import XmppBridge from './XmppBridge.mjs';

const settings = {
    jid: process.env.XMPP_JID,
    destination_jid: null,
    is_destination_groupchat: false,
    password: process.env.XMPP_PASSWORD
};

let extras = [];
// The first arg is the script name itself
for(let i = 1; i < process.argv.length; i++) {
    if(!process.argv[i].startsWith("-")) {
        extras.push(process.argv[i]);
        continue;
    }

    switch(process.argv[i]) {
        case "-h":
        case "--help":
            // ........
            break;

        // ........

        default:
            console.error(`Error: Unknown argument '${process.argv[i]}'.`);
            process.exit(2);
            break;
    }
}

We start with a shebang, telling Linux-based systems to execute the script with Node.JS. Following that, we import the XmppBridge class that's located in XmppBrdige.mjs (we'll come back to this later). Then, we define an object to hold our settings - and pull in the environment variables along with defining some defaults for other parameters.

With that setup, we can then parse the command-line arguments themselves - using the exact same paradigm I've used time and time again in C♯.

Once the command-line arguments are parsed, we validate the final settings to ensure that the user hasn't left any required parameters undefined:

for(let environment_varable of ["XMPP_JID", "XMPP_PASSWORD"]) {
    if(typeof process.env[environment_varable] == "undefined") {
        console.error(`Error: The environment variable ${environment_varable} wasn't found.`);
        process.exit(1);
    }
}

if(typeof settings.destination_jid != "string") {
    console.error("Error: No destination jid specified.");
    process.exit(5);
}

That's basically all that index.mjs does. All that's really left is passing the parameters to an instance of XmppBridge:

const bridge = new XmppBridge(
    settings.destination_jid,
    settings.is_destination_groupchat
);
bridge.start(settings.jid, settings.password);

Shebang Trouble

Because I've used ES6 modules here, currently Node must be informed of this via the --experimental-modules CLI argument like this:

node --experimental-modules ./index.mjs

If we're going to make this a global command-line tool via the bin directive in package.json, then we're going to have to ensure that this flag gets passed to Node and not our program. While we could alter the shebang, that comes with the awkward problem that not all systems (in fact relatively few) support using both env and passing arguments. For example, this:

#!/usr/bin/env node --experimental-modules

Wouldn't work, because env doesn't recognise that --experimental-modules is actually a command-line argument and not part of the binary name that it should search for. I did see some Linux systems support env -S to enable this functionality, but it's hardly portable and doesn't even appear to work all the time anyway - so we'll have to look for another solution.

Another way we could do it is by dropping the env entirely. We could do this:

#!/usr/local/bin/node --experimental-modules

...which would work fine on my system, but probably not on anyone else's if they haven't installed Node to the same place. Sadly, we'll have to throw this option out the window too. We've still got some tricks up our sleeve though - namely writing a bash wrapper script that will call node telling it to execute index.mjs with the correct arguments. After a little bit of fiddling, I came up with this:

#!/usr/bin/env bash
install_dir="$(dirname "$(readlink -f $0)")";
exec node --experimental-modules "${install_dir}/index.mjs" $@

2 things are at play here. Firstly, we have to deduce where the currently executing script actually lies - as npm uses a symbolic link to allow a global command-line tool to be 'found'. Said symbolic link gets put in /usr/local/bin/ (which is, by default, in the user's PATH), and links to where the script is actually installed to.

To figure out the directory that we've been installed to is (and hence the location of index.mjs), we need to dereference the symbolic link and strip the index.sh filename away. This can be done with a combination of readlink -f (dereferences the symbolic link), dirname (get the parent directory of a given file path), and $0 (holds the path to the currently executing script in most circumstances) - which, in the case of the above, gets put into the install_dir variable.

The other issue is passing all the existing command-line arguments to index.mjs unchanged. We do this with a combination of $@ (which refers to all the arguments passed to this script except the script name itself) and exec (which replaces the currently executing process with a new one - in this case it replaces the bash shell with node).

This approach let's us customise the CLI arguments, while still providing global access to our script. Here's an extract from xmppbridge's package.json showing how I specify that I want index.sh to be a global script:

{
    .....

    "bin": {
        "xmppbridge": "./index.sh"
    },

    .....
}

Bridging the Gap

Now that we've got Node calling our script correctly and the arguments parsed out, we can actually bridge the gap. This is as simple as some glue code between simple-xmpp and readline. simple-xmpp is an npm package that makes programmatic XMPP interaction fairly trivial (though I did have to look at examples in the GitHub repository to figure out how to send a message to a multi-user chatroom).

readline is a Node built-in that allows us to read the standard input line-by-line. It does other things too (and is great for interactive scripts amongst other things), but that's a tale for another time.

The first task is to create a new class for this to live in:

"use strict";

import readline from 'readline';

import xmpp from 'simple-xmpp';

class XmppBridge {

    /**
     * Creates a new XmppBridge instance.
     * @param   {string}    in_login_jid        The JID to login with.
     * @param   {string}    in_destination_jid  The JID to send stdin to.
     * @param   {Boolean}   in_is_groupchat     Whether the destination JID is a group chat or not.
     */
    constructor(in_destination_jid, in_is_groupchat) {
        // ....
    }
}

export default XmppBridge;

Very cool! That was easy. Next, we need to store those arguments and connect to the XMPP server in the constructor:

this.destination_jid = in_destination_jid;
this.is_destination_groupchat = in_is_groupchat;

this.client = xmpp;
this.client.on("online", this.on_connect.bind(this));
this.client.on("error", this.on_error.bind(this));
this.client.on("chat", ((_from, _message) => {
    // noop
}).bind(this));

I ended up having to define a chat event handler - even though it's pointless, as I ran into a nasty crash if I didn't do so (I suspect that this use-case wasn't considered by the original package developer).

The next area of interest is that online event handler. Note that I've bound the method to the current this context - this is important, as it would be able to access the class instance's properties otherwise. Let's take a look at the code for that handler:

console.log(`[XmppBridge] Connected as ${data.jid}.`);
if(this.is_destination_groupchat) {
    this.client.join(`${this.destination_jid}/bot_${data.jid.user}`);
}
this.stdin = readline.createInterface({
    input: process.stdin,
    output: process.stdout,
    terminal: false
});
this.stdin.on("line", this.on_line_handler.bind(this));
this.stdin.on("close", this.on_stdin_close_handler.bind(this));

This is the point at which we open the standard input and start listening for things to send. We don't do it earlier, as we don't want to end up in a situation where we try sending something before we're connected!

If we're supposed to be sending to a multi-user chatroom, this is also the point at which it joins said room. This is required as you can't send a message to a room that you haven't joined.

The resource (the bit after the forward slash /), for a group chat, specifies the nickname that you want to give to yourself when joining. Here, I automatically set this to the user part of the JID that we used to login prefixed with bot_.

The connection itself is established in the start method:

start(jid, password) {
    this.client.connect({
        jid,
        password
    });
}

And every time we receive a line of input, we execute the send() method:

on_line_handler(line_text) {
    this.send(line_text);
}

I used a full method here, as initially I had some issues and wanted to debug which methods were being called. That send method looks like this:

send(message) {
    this.client.send(
        this.destination_jid,
        message,
        this.is_destination_groupchat
    );
}

The last event handler worth mentioning is the close event handler on the readline interface:

on_stdin_close_handler() {
    this.client.disconnect();
}

This just disconnects from the XMXPP server so that Node can exit cleanly.

That basically completes the script. In total, the entire XmppBridge.mjs class file is 72 lines. Not bad going!

You can install this tool for yourself with sudo npm install -g xmppbridge. I've documented how it use it in the README, so I'd recommend heading over there if you're interested in trying it out.

Found this interesting? Got a cool use for XMPP? Comment below!

Sources and Further Reading

Converting my timetable to ical with Node.JS and Nightmare

(Source: Taken by me!)

My University timetable is a nightmare. I either have to use a terrible custom app for my phone, or an awkwardly-built website that feels like it's at least 10 years old!

Thankfully, it's not all doom and gloom. For a number of years now, I've been maintaining a Node.JS-based converter script that automatically pulls said timetable down from the JSON backend of the app - thanks to a friend who reverse-engineered said app. It then exports it as a .ical file that I can upload to my server & subscribe to in my pre-existing calendar.

Unfortunately, said backend changed quite dramatically recently, and broke my script. With the only alternative being the annoying timetable website that really don't like being scraped.

Where there's a will, there's a way though. Not to be deterred, I gave it a nightmare of my own: a scraper written with Nightmare.JS - a Node.JS library that acts, essentially, as a scriptable web-browser!

While the library has some kinks (especially with .wait("selector")), it worked well enough for me to implement a scraper that pulled down my timetable in HTML form, which I then proceeded to parse with cheerio.

The code is open-source (find it here!) - and as of this week I've updated it to work with the new update to the timetabling system this semester. A further update will be needed in early December time, which I'll also be pushing to the repository.

The README of the repository should contain adequate instructions for getting it running yourself, but if not, please open an issue!

Note that I am not responsible for anything that happens as a result of using this script! I would strongly recommend setting up the secure storage of your password if you intend to automate it. I've just written this to solve a problem in order to ensure that I can actually get to my lectures on time - and not an hour late or on the wrong week because I've misread the timetable (again)!

In the future, I'd like to experiment with other scriptable web-browser frameworks to compare them with my experiences with NightmareJS.

Found this interesting? Found a better way to do this? Comment below!

Routers: Essential, everywhere, and yet exasperatingly elusive

Now that I've finished my University work for the semester (though do have a few loose ends left to tie up), I've got some time on my hands to do a bunch of experimenting that I haven't had the time for earlier in the year.

In this case, it's been tracking down an HTTP router that I used a few years ago. I've experimented with a few now (find-my-way, micro-http-router, and rill) - but all of them few something wrong with them, or feel too opinionated for my taste.

I'm getting slightly ahead of myself though. What's this router you speak of, and why is it so important? Well, it call comes down to application design. When using PHP, you can, to some extent, split your application up by having multiple files (though I would recommend filtering everything through a master index.php). In Node.JS, which I've been playing around with again recently, that's not really possible.

A comparison of the way PHP and Node.JS applications are structured. See the explanation below.

Unlike PHP, which gets requests handed to it from a web server like Nginx via CGI (Common Gateway Interface), Node.JS is the server. You can set up your very own HTTP server listening on port 9898 like this:

import http from 'http';
const http_server = http.createServer((request, response) => {
    response.writeHead(200, {
        "x-custom-header": "yay"
    });
    response.end("Hello, world!");
}).listen(9898, () => console.log("Listening on pot 9898"));

This poses a problem. How do we know what the client requested? Well, there's the request object for that - and I'm sure you can guess what the response object is for - but the other question that remains is how to we figure out which bit of code to call to send the client the correct response?

That's where a request router comes in handy. They come in all shapes and sizes - ranging from a bare-bones router to a full-scale framework - but their basic function is the same: to route a client's request to the right place. For example, a router might work a little bit like this:

import http from 'http';
import Router from 'my-awesome-router-library';

// ....

const router = new Router();

router.get("/login", route_login);
router.put("/inbox/:username", route_user_inbox_put);

const http_server = http.createServer(router.handler()).listen(20202);

Pretty simple, right? This way, every route can lead to a different function, and each of those functions can be in a separate file! Very cool. It all makes for a nice and neat way to structure one's application, preventing any issues relating to any one file getting too big - whilst simultaneously keeping everything orderly and in its own place.

Except when you're picky like me and you can't find a router you like, of course. I've got some pretty specific requirements. For one, I want something flexible and unopinionated enough that I can do my own thing without it getting in the way. For another, I'd like first-class support for middleware.

What's middleware you ask? Well, I've only just discovered it recently, but I can already tell that's its a very powerful method of structuring more complex applications - and devastatingly dangerous if used incorrectly (the spaghetti is real).

Basically, the endpoint of a route might parse some data that a client has sent it, and maybe authenticate the request against a backend. Perhaps a specific environment needs to be set up in order for a request to be fulfilled.

While we could do these things in the end route, it would clutter up the code in the end route, and we'd likely have more boilerplate, parsing, and environment setup code than we have actual application logic! The solution here is middleware. Think of it as an onion, with the final route application logic in the middle, and the parsing, logging, and error handling code as the layers on the outside.

A diagram visualising an application with 3 layers of middleware: an error handler, a logger, and a data parser - with the application logic in the middle. Arrows show that a request makes its way through these layers of middleware both on the way in, and the way out.

In order to reach the application logic at the centre, an incoming request must first make its way through all the layers of middleware that are in the way. Similarly, it must also churn through the layers of middleware in order to get out again. We could represent this in code like so:

// Middleware that runs for every request
router.use(middleware_error_handler);
router.use(middleware_request_logger);

// Decode all post data with middleware
// This won't run for GET / HEAD / PUT / etc. requests - only POST requests
router.post(middleware_decode_post_data);

// For GET requestsin under `/inbox`, run some middleware
router.get("/inbox", middleware_setup_user_area);

// Endpoint routes
// These function just like middleware too (i.e. we could
// pass the request through to another layer if we wanted
// to), but that don't lead anywhere else, so it's probably
// better if we keep them separate
router.get("/inbox/:username", route_user_inbox);
router.any("/honeypot", route_spambot_trap);

router.get("/login", route_display_login_page);
router.post("/login", route_do_login);

Quite a neat way of looking at it, right? Lets take a look at some example middleware for our fictional router:

async function middleware_catch_errors(context, next) {
    try {
        await next();
    } catch(error) {
        console.error(error.stack);
        context.response.writeHead(503, {
            "content-type": "text/plain"
        });
        // todo make this fancier
        context.response.end("Ouch! The server encountered an error and couldn't respond to your request. Please contact bob at [email protected]!");
    }
}

See that next() call there? That function call there causes the application to enter the next layer of middleware. We can have as many of these layers as we like - but don't go crazy! It'll cause you problems later whilst debugging.....

What I've shown here is actually very similar to the rill framework - it just has a bunch of extras tagged on that I don't like - along with some annoying limitations when it comes to defining routes.

To that end, I think I'll end up writing my own router, since none of the ones I've found will do the job just right. It kinda fits with the spirit of the project that this is for, too - teaching myself new things that I didn't know before.

If you're curious as to how a Node.JS application is going to fit in with a custom HTTP + WebSockets server written in C♯, then the answer is a user management panel. I'm not totally sure where this is going myself - I'll see where I end up! After all, with my track record, you're bound to find another post or three showing up on here again some time soon.

Until then, Goodnight!

Found this useful? Still got questions? Comment below!

Distributing work with Node.js

A graph of the data I generated by writing the scripts I talk about in this post. (Above: A pair of graphs generated with gnuplot from the data I crunched with the scripts I talk about in this blog post. Anti-aliased version - easier to pick details out [928.1 KiB])

I really like Node.js. For those not in the know, it's basically Javascript for servers - and it's brilliant at networking. Like really really good. Like C♯-beating good. Anyway, last week I had a 2-layer neural network that I wanted to simulate all the different combinations from 1-64 nodes in both layers for, as I wanted to generate a 3-dimensional surface graph of the error.

Since my neural network (which is also written in Node.js :P) has a command-line interface, I wrote a simple shell script to drive it in parallel, and set it going on a Raspberry Pi I have acting as a file server (it doesn't do much else most of the time). After doing some calculations, I determined that it would finish at 6:40am Thursday..... next week!

Of course, taking so long is no good at all if you need it done Thursday this week - so I set about writing a script that would parallelise it over the network. In the end I didn't actually include the data generated in my report for which I had the Thursday deadline, but it was a cool challenge nonetheless!

Server

To start with, I created a server script that would allocate work items, called nodecount-surface-server.js. The first job was to set things up and create a quick settings object and a work item generator:

#!/usr/bin/env node
// ^----| Shebang to make executing it on Linux easier

const http = require("http"); // We'll need this later

const settings = {
    port: 32000,
    min: 1,
    max: 64,
};
settings.start = [settings.min, settings.min];

function* work_items() {
    for(let a = settings.start[0]; a < settings.max; a++) {
        for(let b = settings.start[1]; b < settings.max; b++) {
            yield [a, b];
        }
    }
}

That function* is a generator. C♯ has them too - and they let a function return more than one item in an orderly fashion. In my case, it returns arrays of numbers which I use as the topology for my neural networks:

[1, 1]
[1, 2]
[1, 3]
[1, 4]
....

Next, I wrote the server itself. Since it was just a temporary script that was running on my local network, I didn't implement too many security measures - please bear this in mind if using or adapting it yourself!


function calculate_progress(work_item) {
    let i = (work_item[0]-1)*64 + (work_item[1]-1), max = settings.max * settings.max;
    return `${i} / ${max} ${(i/max*100).toFixed(2)}%`;
}

var work_generator = work_items();

const server = http.createServer((request, response) => {
    switch(request.method) {
        case "GET":
            let next = work_generator.next();
            let next_item = next.value;
            if(next.done)
                break;
            response.write(next_item.join("\t"));
            console.error(`[allocation] [${calculate_progress(next_item)}] ${next_item}`);
            break;
        case "POST":
            var body = "";
            request.on("data", (data) => body += data);
            request.on("end", () => {
                console.log(body);
                console.error(`[complete] ${body}`);
            })
            break;
    }
    response.end();
});
server.on("clientError", (error, socket) => {
    socket.end("HTTP/1.1 400 Bad Request");
});
server.listen(settings.port, () => { console.error(`Listening on ${settings.port}`); });

Basically, the server accepts 2 types of requests:

GET requests, which ask for work
POST requests, which respond with the results of a work item

In my case, I send out work items like this:

11  24

...and will be receiving work results like this:

11  24  0.2497276811644629

This means that I don't even need to keep track of which work item I'm receiving a result for! If I did though, I'd probably having some kind of ID-based system with a list of allocated work items which I could refer back to - and periodically iterate over to identify any items that got lost somewhere so I can add them to a reallocation queue.

With that, the server was complete. It outputs the completed work item results to the standard output, and progress information to the standard error. This allows me to invoke it like this:

node ./nodecount-surface-server.js >results.tsv

Worker

Very cool. A server isn't much good without an army of workers ready and waiting to tear through the work items it's serving at breakneck speed though - and that's where the worker comes in. I started writing it in much the same way I did the server:

#!/usr/bin/env node
// ^----| Another shebang, just like the server

const http = require("http"); // We'll need this to talk to the server later
const child_process = require("child_process"); // This is used to spawn the neural network subprocess

const settings = {
    server: { host: "172.16.230.58", port: 32000 },
    worker_command: "./network.js --epochs 1000 --learning-rate 0.2 --topology {topology} <datasets/acw-2-set-10.txt 2>/dev/null"
};

That worker_command there in the settings object is the command I used to execute the neural network, with a placeholder {topology} which we find-and-replace just before execution. Due to obvious reasons (no plagiarism thanks!) I can't release that script itself, but it's not necessary to understand how the distributed work item systme I've written works. It could just as well be any other command you like!

Next up is the work item executor itself. Since it obviously takes time to execute a work item (why else would I go to such lengths to process as many of them at once as possible :P), I take a callback as the 2nd argument (it's just like a delegate or Action in C♯):


function execute_item(data, callback) {
    let command = settings.worker_command.replace("{topology}", data.join(","));
    console.log(`[execute] ${command}`);
    let network_process = child_process.exec(command, (error, stdout, stderr) =>  {
        console.log(`[done] ${stdout.trim()}`);
        let result = stdout.trim().split(/\t|,/g);
        let payload = `${result[0]}\t${result[1]}\t${result[5]}`;

        let request = http.request({
            hostname: settings.server.host,
            port: settings.server.port,
            path: "/",
            method: "POST",
            headers: {
                "content-length": payload.length
            }
        }, (response) => {
            console.log(`[submitted] ${payload}`);
            callback();
        });
        request.write(payload);
        request.end();
    });
}

In the above I substitute in the work item array as a comma-separated list, execute the command as a subprocess, report the result back to the server, and then call the callback. To report the result back I use the http module built-in to Node.JS, but if I were tidy this up I would probably use an npm package like got instead, as it simplifies the code a lot and provides more features / better error handling / etc.

A work item executor is no good without any work to do, so that's what I tackled next. I wrote another function that fetches work items from the server and executes them - wrapping the whole thing in a Promise to make looping it easier later:


function do_work() {
    return new Promise(function(resolve, reject) {
        let request = http.request({
            hostname: settings.server.host,
            port: settings.server.port,
            path: "/",
            method: "GET"
        }, (response) => {
            var body = "";
            response.on("data", (chunk) => body += chunk);
            response.on("end", () => {
                if(body.trim().length == 0) {
                    console.error(`No work item received. We're done!`);
                    process.exit();
                }
                let work_item = body.split(/\s+/).map((item) => parseInt(item.trim()));
                console.log(`[work item] ${work_item}`);
                execute_item(work_item, resolve);
            });
        });
        request.end();
    });
}

Awesome! It's really coming together. Doing just one work item isn't good enough though, so I took it to the next level:

function* do_lots_of_work() {
    while(true) {
        yield do_work();
    }
}

// From https://starbeamrainbowlabs.com/blog/article.php?article=posts/087-Advanced-Generators.html
function run_generator(g) {
    var it = g(), ret;

    (function iterate() {
        ret = it.next();
        ret.value.then(iterate);
    })();
}

run_generator(do_lots_of_work);

Much better. That completed the worker script - so all that remained was to set it going on as many machines as I could get my hands on, sit back, and watch it go :D

I did have some trouble with crashes at the end because there was no work left for them to do, but it didn't take (much) fiddling to figure out where the problem(s) lay.

Each instance of the worker script can max out a single core of a machine, so multiple instances of the worker script are needed per machine in order to fully utilise a single machine's resources. If I ever need to do this again, I'll probably make use of the built-in cluster module to simplify it such that I only need to start a single instance of the worker script per machine instance of 1 for each core.

Come to think of it, it would have looked really cool if I'd done it at University and employed a whole row of machines in a deserted lab doing the crunching - especially since it was for my report....

Liked this post? Got an improvement? Comment below!

Stardust Blog

Tag Cloud