Stardust | Starbeamrainbowlabs

Inter-process communication between Javascript and Python

Often, different programming languages are good at different things. To this end, it is sometimes desirable to write different parts of a program in different languages. In no situation is this more apparent than with an application implementing some kind of (ethical, I hope) AI-based feature.

I'm currently kinda-sort-mayenbe thinking of implementing a lightweight web interface backed by an AI model (more details if the idea comes to fruition), and while I like writing web servers in Javascript (it really shines with asynchronous input/output), AI models generally don't like being run in Javascript very much - as I have mentioned before, Tensorflow.js has a number of bugs that mean it isn't practically useful for doing anything serious with AI.

Naturally, the solution then is to run the AI stuff in Python (yeah, Python sucks - believe me I know) since it has the libraries for it, and get Javascript/Node.js to talk to the Python subprocess via inter-process communication, or IPC.

While Node.js has a fanceh message-passing system it calls IPC, this doesn't really work when communicating with processes that don't also run Javascript/Node.js. To this end, the solution is to use the standard input (stdin) and standard output (stdout) of the child process to communicate:

A colourful diagram of the IPC setup implemented in this post. Node.js, Python, and Terminal are 3 different coloured boxes. Python talks to Node.js via stdin and stdout as input and output respectively. Python's stderr interacts direct with the terminal, as does Node.js' stdin, stdout, stderr.

(Above: A diagram of how the IPC setup we're going for works. Editing file)

This of course turned out to be more nuanced and complicated than I expected, so I thought I'd document it here - especially since the Internet was very unhelpful on the matter.

Let's start by writing the parent Node.js script. First, we need to spawn that Python subprocess, so let's do that:

import { spawn } from 'child_process';
const python = spawn("path/to/child.py", {
    stdio: [ "pipe", "pipe", "inherit" ]
});

...where we set stdin and stdout to pipe mode - which let's us interact with the streams - and the standard error (stderr) to inherit mode, which allows it to share the parent process' stderr. That way errors in the child process propagate upwards and end up in the same log file that the parent process sends its output to.

If you need to send the Python subprocess some data to start with, you have to wait until it is initialised to send it something:

python.on(`spawn`, () => {
    console.log(`[node:data:out] Sending initial data packet`);
    python.stdin.write(`start\n`);
});

...an easier alternative than message passing for small amounts of data would be to set an environment variable when you call child_process.spawn - i.e. env: { key: "value" } in the options object above.

Next, we need to read the response from the Python script. Let's do that next:

import nexline from 'nexline'; // Put this import at the top of the file

const reader = nexline({
    input: python.stdout,
})

for await(const line of reader) {
    console.log(`[node:data:in] ${line}`)
}

The simplest way to do this would be to listen for the data event on python.stdout, but this does not guarantee that each chunk that arrives is actually a line of data, since data between processes is not line-buffered like it is when displaying content in the terminal.

To fix this, I suggest using one of my favourite npm packages: nexline. Believe it or not, handling this issue efficiently with minimal buffering is a lot more difficult than it sounds, so it's just easier to pull in a package to do it for you.

With a nice little for await..of loop, we can efficiently read the responses from the Python child process.

If you were doing this for real, I would suggest wrapping this in an EventEmitter (Node.js) / EventTarget (WHAT WG browser spec, also available in Node.js).

Python child process

That's basically it for the child process, but what does the Python script look like? It's really quite easy actually:

import sys

sys.stderr.write(f"[python] hai\n")
sys.stderr.flush()

count = 0
for line in sys.stdin:
    sys.stdout.write(f"boop" + str(count) + "\n")
    sys.stdout.flush()
    count += 1

Easy! We can simply iterate sys.stdin to read from the parent Node.js process.

We can write to sys.stdout to send data back to the parent process, but it's important to call sys.stdout.flush()! Node.js doesn't have an equivalent 'cause it's smart, but in Python it may not actually send the response until who-know-when (if at all) unless you call .flush() to force it to. Think of it as batching graphics draw calls to increase efficiency, but in this case it doesn't work in our favour.

Conclusion

This is just a quick little tutorial on how to implement Javascript/Node.js <--> Python IPC. We deal im plain-text messages here, but I would recommend using JSON - JSON.stringify()/JSON.parse() (Javascript) | json.dumps() / json.loads (Python) - to serialise / deserialise messages to ensure robustness. JSON by default contains no newline characters and escapes any present into \n, so it should be safe in this instance.

See also JSON Lines, a related specification.

Until next time!

Code

index.mjs:

#!/usr/bin/env node
"use strict";

import { spawn } from 'child_process';
import nexline from 'nexline';

///
// Spawn subprocess
///
const python = spawn("/tmp/x/child.py", {
    env: {  // Erases the parent process' environment variables
        "TEST": "value"
    },
    stdio: [ "pipe", "pipe", "inherit" ]
});

python.on(`spawn`, () => {
    console.log(`[node:data:out] start`);
    python.stdin.write(`start\n`);
});

///
// Send stuff on loop - example
///
let count = 0;
setInterval(() => {
    python.stdin.write(`interval ${count}\n`);
    console.log(`[node:data:out] interval ${count}`);
    count++;
}, 1000);


///
// Read responses
///
const reader = nexline({
    input: python.stdout,
})

for await(const line of reader) {
    console.log(`[node:data:in] ${line}`)
}

child.py:

#!/usr/bin/env python3
import sys

sys.stderr.write(f"[python] hai\n")
sys.stderr.flush()

count = 0
for line in sys.stdin:
    # sys.stderr.write(f"[python:data:in] {line}\n")
    # sys.stderr.flush()

    sys.stdout.write(f"boop" + str(count) + "\n")
    sys.stdout.flush()
    count += 1

Building the science festival demo: How to monkeypatch an npm package

A pink background dotted with bananas, with the patch-package logo front and centre, and the npm logo small in the top-left. Small brown package boxes are present in the bottom 2 corners.

In a previous post, I talked about the nuts and bolts of the demo on a technical level, and put it's all put together. I alluded to the fact that I had to monkeypatch Babylon.js to disable the gamepad support because it was horribly broken, and I wanted to dedicate an entire post to the subject.

Partly because it's a clever hack I used, and partly because if I ever need to do something similar again I want a dedicated tutorially-style post on how I did it so I can repeat the process.

Monkeypatching an npm package after installation in a reliable way is an inherently fragile task: it is not something you want to do if you can avoid. In some cases though, it's unavoidable:

If you're short on time, and need something to work
If you are going to submit a pull request to fix something now, but need an interim workaround until your pull request is accepted upstream
If upstream doesn't want to fix the problem, and you're forced to either maintain a patch or fork upstream into a new project, which is a lot more work.

We'll assume that one of these 3 cases is true.

In the game Factorio, there's a saying 'there's a mod for that' that is often repeated in response to questions in discourse about the game. The same is true of Javascript: If you need to do a non-trivial thing, there's usually an npm package that does it that you can lean on instead of reinventing the wheel.

In this case, that package is called patch-package. patch-package is a lovely little tool that enables you to do 2 related things:

a) Generate patch files simply by editing a given npm package in-situ b) Automatically and transparently apply generated patch files on npm install, requiring no additional setup steps should you clone your project down from its repository and run npm install.

Assuming you have a working setup with the target npm package you want to patch already installed, first install patch-package:

npm install --save patch-package

Note: We don't --save-dev here, because patch-package needs to run any time the target package is installed... not just in your development environment - unless the target package to patch is also a development dependency.

Next, delve into node_modules/ and directly edit the files associated with the target package you want to edit.

Sometimes, projects will ship multiple npm packages, with one being containing the pre-minified build distribution, and th other distributing the raw source - e.g. if you have your own build system like esbuild and want to tree-shake it.

This is certainly the case for Babylon.js, so I had to switch from the main babylonjs package to @babylon/core, which contains the source. Unfortunately official documentation for Babylon.js is rather inconsistent which can lead to confusion using the latter, but once I figured out how the imports worked it all came out in the wash.

Once done, generate the patch file for the target package like so:

npx patch-package your-package-name-here

This should create a patch file in the directory patches/ alongside your package.json file.

The final step is to enable automatic and transparent application of the new patch file on package installation. To do this, open up your package.json for editing, and add the following to the scripts object:

"scripts": {
    "postinstall": "patch-package"
}

...so a complete example might look a bit like this:

{
    "name": "research-smflooding-vis",
    "version": "1.0.0",
    "description": "Visualisations of the main smflooding research for outreach purposes",
    "main": "src/index.mjs",

    // ....

    "scripts": {
        "postinstall": "patch-package",
        "test": "echo \"No tests have been written yet.\"",
        "build": "node src/esbuild.mjs",
        "watch": "ESBUILD_WATCH=yes node src/esbuild.mjs"
    },

    // ......

    "dependencies": {
        // .....
    }
}

That's really all you need to do!

After you've applied the patch like this, don't forget to commit your changes to your git/mercurial/whatever repository.

I would also advise being a bit careful installing updates to any packages you've patched in future, in case of changes - though of course installing dependency package updates are vitally important to keep your code updated and secure.

As a rule of thumb, I recommend actively working to minimise the number of patches you apply to packages, and only use this method as a last resort.

That's all for this post. In future posts, I want to look more at the AI theory behind the demo, it's implications, and what it could mean for research in the field in the future (is there even a kind of paper one writes about things one learns from outreach activities that accidentally have a bearing on my actual research? and would it even be worth writing something formal? a question for my supervisor ~~and commenters on that blog post when it comes out~~ I think).

See you in the next post!

(Background to post banner: Unsplash)

Tensorflow / Tensorflow.js in Review

For my PhD, I've been using both Tensorflow.js (Tensorflow for Javascript) and more recently Tensorflow for Python (including the bundled Keras) extensively for implementing multiple different models. Given the experiences I've had so far, I thought it was high time I put my thoughts to paper so to speak and write a blog post reviewing the 2 frameworks.

Tensorflow logo

Tensorflow for Python

Let's start with Tensorflow for Python. I haven't been using it as long as Tensorflow.js, but as far as I can tell they've done a great job of ensuring it comes with batteries included. It has layers that come in an enormous number of different flavours for doing everything you can possibly imagine - including building Transformers (though I ended up implementing the time signal encoding in my own custom layer).

Building custom layers is not particularly difficult either - though you do have to hunt around a bit for the correct documentation, and I haven't yet worked out the all the bugs with loading model checkpoints that use custom layers back in again.

Handling data as a generic "tensor" that contains an n-dimension slab of data is - once you get used to it - a great way of working. It's not something I would recommend to the beginner however - rather I would recommend checking out Brain.js. It's easier to setup, and also more transparent / easier to understand what's going on.

Data preprocessing however is where things start to get complicated. Despite a good set of API reference docs to refer to, it's not clear how one is supposed to implement a performant data preprocessing pipeline. There are multiple methods for doing this (tf.data.Dataset, tf.utils.Sequence, and others), and I have as of yet been unable to find a definitive guide on the subject.

Other small inconsistencies are also present, such as both the Keras website and the Tensorflow API docs both documenting the Keras API, which in and of itself appears to be an abstraction of the Tensorflow API.... it gets confusing. Some love for the docs more generally is also needed, as I found some the wording in places ambiguous as to what it meant - so I ended up guessing and having to work it out by experimentation.

By far the biggest issue I encountered though (aside from the data preprocessing pipeline, which is really confusing and frustrating) is that a highly specific version of CUDA is required for each version of Tensorflow. Thankfully, there's a table of CUDA / CuDNN versions to help you out, but it's still pretty annoying that you have to have a specific version. Blender manages to be CUDA enabled while supporting enough different versions of CUDA that I haven't had an issue on stock Ubuntu with the propriety Nvidia drivers and the system CUDA version, so whhy can't Tensorflow do it too?

Tensorflow.js

This brings me on to Tensorflow.js, the Javascript bindings for libtensorflow (the underlying C++ library). This also has the specific version of CUDA issue, but in addition the version requirement documented in the README is often wrong, leaving you to make random wild guesses as to which version is required!

Despite this flaw, Tensorflow.js fixes a number of inconsistencies in Tensorflow for Python - I suspect that it was written after Tensorflow for Python was first implemented. The developers have effectively learnt valuable lessons from the Python version of Tensorflow, which has resulted in a coherent and cohesive API that makes a much more sense than the API in Python. A great example of this is tf.Dataset: The data preprocessing pipeline in Tensorflow.js is well designed and easy to use. The Python version could learn a lot from this.

While Tensorflow.js doesn't have quite the same set of features (a number of prebuilt layers that exist in Python don't yet existing in Tensorflow.js), it still provides a reasonable set of features that satisfy most use-cases. I have noticed a few annoying inconsistencies in the loss functions though and how they behave - for example I implemented an autoencoder in Tensorflow.js, but it only returned black and white pixels - whereas Tensorflow for Python returned greyscale as intended.

Aside from improving CUDA support and adding more prebuilt layers, the other thing that Tensorflow.js struggles with is documentation. It has comparatively few guides with respect to Tensorflow for Python, and it also has an ambiguity problem in the API docs. This could be mostly resolved by being slightly more generous with explanations as to what things are and do. Adding some examples to the docs in question would also help - as would also fixing the bug where it does not highlight the item you're currently viewing in the navigation page (DevDocs support would be awesome too).

Conclusion

Tensorflow for Python and Tensorflow.js are feature-filled and performant frameworks for machine learning and processing large datasets with GPU acceleration. Many tutorials are provided to help newcomers to the frameworks, but once you've followed a tutorial or 2 you're left very much to be on your own. A number of caveats and difficulties such as CUDA versions and confusing APIs / docs make mastering the framework difficult.

applause-cli: A Node.js CLI handling library

Continuing in the theme of things I've forgotten to talk about, I'd like to post about another package I've released a little while ago. I've been building a number of command line interfaces for my PhD, so I thought it would be best to use a library for this function.

I found [clap](), but it didn't quite do what I wanted - so I wrote my own inspired by it. Soon enough I needed to use the code in several different projects, so I abstracted the logic for it out and called it applause-cli, which you can now find on npm.

It has no dependencies, and it allows you do define a set of arguments and have it parsed out the values from a given input array of items automatically. Here's an example of how it works:

import Program from 'applause-cli';

let program = new Program("path/to/package.json");
program.argument("food", "Specifies the food to find.", "apple")
    .argument("count", "The number of items to find", 1, "number");

program.parse(process.argv.slice(2)); // Might return { food: "banana", count: 6 }

I even have automated documentation generated with documentation and uploaded to my website via Continuous Integration: https://starbeamrainbowlabs.com/code/applause-cli/. I've worked pretty hard on the documentation for this library actually - it even has integrated examples to show you how to use each function!

The library can also automatically generate help output from the provided information when the --help argument is detected too - though I have yet to improve the output if a subcommand is called (e.g. mycommand dostuff --help) - this is on my todo list :-)

Here's an example of the help text it automatically generates:

If this looks like something you'd be interested in using, I recommend checking out the npm package here: https://www.npmjs.com/package/applause-cli

For the curious, applause-cli is open-source under the MPL-2.0 licence. Find the code here: https://github.com/sbrl/applause-cli.

Rendering Time plan / Gantt charts: hourgraph

I have a number of tools and other programs I've implemented, but forgotten to blog about here - hourgraph is one such tool I stumbled across today again. Originally I implemented it for my PhD panel 1 topic project analysis report, as I realised that not only have I manually created a number of these, but I'm going to have to create a bunch more in the future, but I open-sourced it as I usually do with most of the things I write in the hopes that someone else will find it useful.

I've published it on NPM, so you can install it like this:

npm install --global hourgraph

You'll need Node.js installed, and Linux users will need to prefix the above with sudo.

The program takes in a TOML definition file. Here's an example:

width = 1500
height = 480
title = "Apples"

[[task]]
name = "Pick apples"
start = 0
duration = 3

[[task]]
name = "Make apple juice"
start = 2
duration = 2

[[task]]
name = "Enjoy!"
start = 4
duration = 4
colour = "hsl(46, 90%, 60%)"
ghost_colour = "hsla(46, 90%, 60%, 0.1)"

The full set of options are available in the default config file, which is loaded in to fill in any gaps of things you haven't specified in your custom file.

Comprehensive usage instructions are found in the README, but you can render a new time plan chart thingy like this:

hourgraph --input path/to/input.toml --output path/to/output.toml

The above renders to this:

Hourgraph output

Personally, I find it's much easier to create charts like this by defining them in a simple text file that is then rendered into the actual thing. That way, I don't have to fiddle with the layout myself - it all comes out in the wash automatically.

For those interested in the code, it can be found here: https://github.com/sbrl/hourgraph

Skyliner: Automated text document outlining

When editing large documents, it's often helpful to have a hierarchical "navigation view" of sorts. For a text document like the Markdown that I'm typing now for this blog post, it would consist of a list of headings in the document. For a Javascript or C♯ file, it might consist of classes, functions, and methods.

Either way, it's a helpful thing to have - but for some ridiculous reason as far as I know a generic text document outlining tool doesn't exist.

While GitHub's Atom (my code editor of choice) has an atom-ide has an outline view that works great, it only supports a limited number of languages (you have to have a plugin installed for every language), and sometimes it hangs and takes like 30 seconds plus to generate the outline.

To this end, I intend to remedy the situation with a new library I'm writing call Skyliner.

It's based on finite-state automata and regular expressions (also, Wikipedia and regexper), and it streams the input and generates an outline line-by-line. It provides both a command-line interface and a Javascript API (though the command-line interface currently consumes all input before generating any output, but the library is capable of streaming objects as it consumes the input). As of the time of typing, it supports the following languages:

clike (e.g. c, c++, header files)
csharp
go
ini
javascript
json
lua
markdown
php
rust
sh (including bash)
toml
xml

While using regular expressions means that the output won't be perfect (especially in the case of XML/HTML), it does mean that adding support for a new language is as simple as defining a new finite-state automaton in a Javascript object. Adding support for Lua was a 10-15 minute job - including automated tests! Support for more languages is definitely on the way.

The combination of line-by-line parsing, regular expressions, and finite-state automata also means that it's much faster than Atom IDE, because it doesn't have to parse the entire document into an abstract syntax tree before generating the outline.

My goal here is to have good support as many languages as possible, rather than amazing support for just a small handful of languages. This isn't to say that I won't fix issues with existing languages as they come up, but the focus is really on ensuring that whenever I open a text document in Atom.

Once I've added support for a bunch of languages, written some documentation, and published it on npm, I'm going to try my hand it implementing my first plugin for GitHub's Atom. Apparently plugins for Atom aren't too difficult to port to Visual Studio Code, so maybe I'll take a look at doing that too (but I don't use Visual Studio Code much, so not a priority).

The code is open-source already (link below) and completely usable, but since it's an ongoing process you can expect another post on here in the future about my progress.

Skyliner on GitHub: https://github.com/sbrl/skyliner

If you'd be interested in some more detail about how it works, I'll be writing some documentation soon, which will appear in the above Git repository.

PhD Aside: Reading a file descriptor line-by-line from multiple Node.js processes

Phew, that's a bit of a mouthful. We're taking a short break from the cluster series of posts (though those will be back next week I hope), because I've just run into a fascinating problem, the solution to which I thought I'd share here - since I didn't find a solution elsewhere on the web.

For my PhD, I've got a big old lump of data, and it all needs preprocessing before I train an AI model (or a variant thereof, since I'm effectively doing video-to-image translation). Unfortunately, one of the preprocessing steps is really slow. And because I'll naturally be training my AI for multiple epochs, the problem is multiplied.....

The solution, of course, is to do all the preprocessing up front such that I can just read the data in and push it directly into a Tensor in the right format. However, doing this on such a large dataset would take forever if I did the items 1 by 1. The thing is that Javascript isn't inherently multithreaded. I like this quote, as it describes the situation rather well:

In Javascript everything runs in parallel... except your code

--Felix Geisendörfer

In other words, when Node.js is reading or writing to and from the network, disk, or other places it can do lots of things at the same time because it does them asynchronously. The Javascript that gets executed though is only done on a single thread though.

This is great for io-bound tasks (such as a web server), as Node.js (a Javascript runtime) can handle many requests at the same time. On a side note, this is also the reason why Nginx is more efficient than Apache (because Nginx is event based too like Javascript, unlike Apache which is thread based).

It's not so great though for CPU bound tasks, such as the one I've got on my hands. All is not lost though, because Node.js has a number of useful functions inbuilt that we can use to tackle the issue.

Firstly, Node.js has a clever forking system. By using child_process.fork(), a single Node.js process can create multiple copies of itself to act as workers:

// main.js
import child_process from 'child_process';
import os from 'os';

let workers = [];

for(let i = 0; i &lt; os.cpus().length; i++) {
    workers.push(
        child_process.fork("worker.mjs")
    );
}

// worker.js
console.log(`Hello, world from a child process!`);

Very useful! The next much more sticky problem though is how to actually preprocess the data in a performant manner. In my specific case, I'm piping the data in from a shell script that decompresses a number of gzip archives in a specific order (as of the time of typing I have yet to implement this).

Because this is a single pipe we're talking about here, the question now arises of how to allow all the child processes to access the data that's coming in from the standard input of the master process.

I've actually encountered an issue like this one before. I initially tried reading it in on the master process, and then using worker.send(message) to send it to the worker processes for processing. This didn't end up working very well, because the master process became a bottleneck as it couldn't read from the standard input and send stuff to the workers fast enough.

With this in mind, I came up with a new plan. In Node.js, when you're forking to create a worker process, you can supply it with some custom file descriptors upon initialisation. So long as it has at least IPC (inter-process communication) channel for passing messages back and forth with the .send() and .on("message", (message) => ....) method and listeners, it doesn't actually care what you do with the others.

Cue file descriptor cloning:


// main.js
import child_process from 'child_process';
import os from 'os';

let workers = [];

for(let i = 0; i

I've highlighted the key line here (line 10 for those who can't see it). Here we tell it to clone file descriptors 0, 1, and 2 - which refer to stdin, stdout, and stderr respectively. This allows the worker processes direct access to the master process' stdin, stdout, and stderr.

With this, we can read from the same pipe with as many worker processes as we like - so long as they do so 1 at a time.

With this sorted, it gives rise to the next issue: reading line-by-line. Packages exist on npm (such as nexline, my personal favourite) to read from a stream line-by-line, but they have the unfortunate side-effect of maintaining a read buffer. While this is great for performance, it's not so great in my situation because it ends up scrambling the input! This is because said read buffer would be local to each worker process, so when the next worker along reads, it will skip a random number of bytes and start reading from the next bit along.

This means that I need to implement a custom method that reads a single line from a given file descriptor without maintaining a read buffer. I came up with this:

import fs from 'fs';

//  .....

// Global buffer to avoid unnecessary memory churn
let buffer = Buffer.alloc(4096);
function read_line_unbuffered(fd) {
    let i = 0;
    while(true) {
        let bytes_read = fs.readSync(fd, buffer, i, 1);
        if(bytes_read !== 1 || buffer[i] == 0x0A) {
            if(i == 0 && bytes_read == null) return null;
            return buffer.toString("utf-8", 0, i); // This is not inclusive, so we can abuse it to trim the \n off the end
        }

        i++;
        if(i == buffer.length) {
            let new_buffer = new Buffer(Math.ceil(buffer.length * 1.5));
            buffer.copy(new_buffer);
            buffer = new_buffer;
        }
    }
}

I read from the given file descriptor character by character directly into a buffer. As soon as it detects a new line character (\n, or character code 0x0A), it returns the new line. If we run out of space in the buffer, then we create a new larger one, copy the old buffer's contents into it, and keep going.

I maintain a global buffer here, because this helps to avoid unnecessary memory churn. In my case, the lines I'm reading in a rather long (hence the need to clone the file descriptor in the first place), and if I didn't keep a shared buffer I'd be allocating and deallocating a new pretty large buffer every time.

This also has the nice side-effect that we keep the largest buffer we've had to use so far around for next time, avoiding the need for subsequent copies to larger and larger buffers.

Finally, we can also guarantee that it won't be a problem if we call this multiple times, because as I explained above Javascript is single-threaded, so if we call the function multiple times in quick succession each read will happen 1 after another.

With this chain of Node.js features, we can read a large amount of data from and efficiently process the content of a pipe. The trick from here is to implement a proper messaging and locking system to avoid reading from the stream at the same time, and avoid write to the standard output at the same time.

Taking this further, I ended up with this:

(Licence: Mozilla Public Licence 2.0)

This correctly ensures that only 1 worker process reads from the stream at the same time. It doesn't do anything with the result though except log a message to the console, but when I implement that I'll implement a similar messaging system to ensure that only 1 process writes to the output at once.

On that note, my data is also ordered, so I'll have to implement a complicated cache system // ordering system to ensure that I write them to the standard output in the same order I read them in. When I do implement that, I'll probably blog about that too....

The main problem I still have with this solution is that I'm reading from the input stream. I haven't done any proper testing, but I'm pretty sure that doing so will be really slow. I not sure I can avoid this though and read a few KiBs at a time, because I don't currently know of any way to put the extra characters back into the input stream.

If anyone has a solution to that that increases performance, I'd love to know. Leave a comment below!

The legend of the disappearing data in Node.js

Happy leap day! :D

A green tree frog :D

_(Above: A nice green tree frog - source)_

Recently, I've been doing a bunch of work in Node.js streaming large amounts of data. For the most part the experience has been highly pleasurable, as Node.js makes it so easy! I have encountered a few pain points though, the most significant of which I'd like to talk about here.

In Node.js, streams come in 3 main forms:

Readable Streams
Writable Streams
Transform Streams

In addition, you can either plug streams together with the .pipe() method, write to them directly with the .write() method, or any combination thereof - allowing you to build up a chain of streams that enables data to flow through your program.

The problems start when you try and write large amounts of data to a stream directly:

import fs from 'fs';

import do_work from 'somewhere';
import get_some_stream from 'somewhere_else';

let stream_in = get_some_stream();
let out = fs.createWriteStream("/tmp/test.txt");
for(let i = 0; i < 1000000; i++) {
    out.write(do_work(stream_in, i))
}

(Above: Just an example of writing lots of data to a stream)

When this happens, you start to lose random chunks of data. The reason for this is not obvious, but it is buried in the Node.js docs:

The writable.write() method writes some data to the stream, and calls the supplied callback once the data has been fully handled. If an error occurs, the callback may or may not be called with the error as its first argument. To reliably detect write errors, add a listener for the 'error' event.

This is a huge pain. It means that you have to wrap all write calls like this:

"use strict";

/**
 * Writes data to a stream, automatically waiting for the drain event if asked.
 * @param   {stream.Writable}           stream_out  The writable stream to write to.
 * @param   {string|Buffer|Uint8Array}  data        The data to write.
 * @return  {Promise}   A promise that resolves when writing is complete.
 */
function write_safe(stream_out, data) {
    return new Promise((resolve, reject) => {
        // Handle errors
        let handler_error = (error) => {
            stream_out.off("error", handler_error);
            reject(error);
        };
        stream_out.on("error", handler_error);

        if(stream_out.write(data)) {
            // We're good to go
            stream_out.off("error", handler_error);
            resolve();
        }
        else {
            // We need to wait for the drain event before continuing
            stream_out.once("drain", () => {
                stream_out.off("error", handler_error);
                resolve();
            });
        }
    });
}

export { write_safe };

Such a huge boilerplate for such a simple task! Basically, if the .write() method returns false, you have to wait until the drain event is fired on the writeable stream before continuing to write to the stream. The reason for this I think is that it signals that the write buffer is full, and it needs to be drained before writing can continue.

This is ok, but it would be nice if this was abstracted away behind a single method, such as the wrapper I've shown above. Something like a async stream.Writable.writeAsync() would be great, but it doesn't currently exist.

I think I'm going to open an issue about it - since it seems very doable and just silly that it doesn't exist already.

Summer Project Series List

At this point, it is basically the end of my summer project series - at least for a while while I start my PhD (more on that in a future post). To this end, I'm releasing the series list for it.

Next Gen Search, Part 2: Pushing the limits

In the last part, we looked at how I built a new backend for storing inverted indexes for Pepperminty Wiki, which allows for partial index deserialisation and other nice features that boost performance considerably.

Since the last post, I've completed work on the new search system - though there are a few bits around the edges that I still want to touch up and do some more work on.

In this post though, I want to talk about how I generated test data to give my full-text search engine something to chew on. I've done this before for my markov chain program I wrote a while back, and that was so much fun I did it again for my search engine here.

After scratching my head for a bit to think of a data source I could use, I came up with the perfect plan. Ages ago I downloaded a Wikipedia dump - just the content pages in Wikitext markup. Why not use that?

As it turns out, it was a rather good idea. Some processing of said dump was required though to transform it into a format that Pepperminty Wiki can understand, though. Pepperminty Wiki stores pages on disk as flat text files in markdown, and indexes them in pageindex.json. If pageindex.json doesn't exist, then Pepperminty Wiki rebuilds it automagically by looking for content pages on disk.

This makes it easy to import batches of new pages into Pepperminty Wiki, so all we need to do is extract the wiki text, convert to markdown, and import! This ended up requiring a number of different separate steps though, so let's take it 1 at a time

First, we need a Wikipedia database dump in the XML format. These are available from dumps.wikimedia.org. There are many different ones available, but I suggest grabbing one that has a filename similar to enwiki-20180201-pages-articles.xml - i.e. just content pages - no revision history, user pages, or additional extras. I think the most recent one as of the time of posting is downloadable here - though I'd warn you that it's 15.3GiB in size! You can see a list of available dump dates for the English Wikipedia here.

Now that we've got our dump, let's extract the pages from it. This is nice and easy to do with wikiextractor on GitHub:

nice -n20 wikiextractor enwiki-20180201-pages-articles.xml --no_templates --html --keep_tables --lists --links --sections --json --output wikipages --compress --bytes 25M >progress.log 2>&1

This will parse the dump and output a number of compressed files to the wikipages directory. These will have 1 JSON object per line, each containing information about a single page on Wikipedia - with page content pre-converted to HTML for us. The next step is to extract the page content and save it to a file with the correct name. This ended up being somewhat complicated, so I wrote a quick Node.js script to do the job:

#!/usr/bin/env node

const readline = require("readline");
const fs = require("fs");


if(!fs.existsSync("pages"))
        fs.mkdirSync("pages", { mode: 0o755 });

// From https://stackoverflow.com/a/44195856/1460422
function html_unentities(encodedString) {
        var translate_re = /&(nbsp|amp|quot|lt|gt);/g;
        var translate = {
                "nbsp":" ",
                "amp" : "&",
                "quot": "\"",
                "lt"  : "<",
                "gt"  : ">"
        };
        return encodedString.replace(translate_re, function(match, entity) {
                return translate[entity];
        }).replace(/&#(\d+);/gi, function(match, numStr) {
                var num = parseInt(numStr, 10);
                return String.fromCharCode(num);
        });
}

const interface = readline.createInterface({
        input: process.stdin,
        //output: process.stdout
});

interface.on("line", (text) => {
        const obj = JSON.parse(text);

        fs.writeFileSync(`pages/${obj.title.replace(/\//, "-")}.html`, html_unentities(obj.text));
        console.log(`${obj.id}\t${obj.title}`);
});

This basically takes the stream of JSON object on the standard input, parses them, and saves the relevant content to disk. We can invoke it like so:

bzcat path/to/*.bz2 | ./parse.js

Don't forget to chmod +x parse.js if you get an error here. The other important thing about the above script ist hat we have to unescape the HTML entities (e.g. >), because otherwise we'll have issues later with HTML conversation and page names will look odd. This is done by the html_unentities() function in the above script.

This should result in a directory containing a large number of files - 1 file per content page. This is much better, but we're still not quite there yet. Wikipedia uses wiki markup (which we converted to HTML with wikiextractor) and Pepperminty Wiki uses Markdown - the 2 of which are, despite all their similarities, inherently incompatible. Thankfully, pandoc is capable of converting from HTML to markdown.

Pandoc is great at this kind of thing - it uses an intermediate representation and allows you to convert almost any type of textual document format to any other format. Markdown to PDF, EPUB to plain text, ..... and HTML to markdown (just to name a few). It actually looks like it shares a number of features with traditional compilers like GCC.

Anyway, let's use it to convert our folder full of wikitext files to a folder full of markdown:

mkdir -p pages_md;
find pages/ -type f -name "*.html" -print0 | nice -n20 xargs -P4 -0 -n1 -I{} sh -c 'filename="{}"; title="${filename##*/}"; title="${title%.*}"; pandoc --from "html"  --to "markdown+backtick_code_blocks+pipe_tables+strikeout" "${filename}" -o "pages_md/${title}.md"; echo "${title}";';

_(See this on explainshell.com - doesn't include the nice -n20 due to a bug on their end)_

This looks complicated, but it really isn't. Let's break it down a bit:

find pages/ -type f -name "*.html" -print0

This finds all the HTML files that we want to convert to Markdown, and delimits the output with a NUL byte - i.e. 0x0`. This makes the next step easier:

... | nice -n20 xargs -P4 -0 -n1 -I{} sh -c '....'

This pushes a new xargs instance into the background, which will execute 4 commands at a time. xargs executes a command for each line of input it receives. In our case, we're delimiting with NUL 0x0 instead though. We explicitly specify that we can 1 command per line of input though, as xargs tries to optimise and do command file1 file2 file3 instead.

The sh -c bit is starting a subshell, in which we execute a small wrapper script that then calls pandoc. This is of course inefficient, but I couldn't find any way around spawning a subshell in this instance.

filename="{}";
title="${filename##*/}";
title="${title%.*}";
pandoc --from "html" --to "markdown+backtick_code_blocks+pipe_tables+strikeout" "${filename}" -o "pages_md/${title}.md";
echo "${title}";

I've broken the sh -c subshell script down into multiple liens for readability. Simply put, it extracts the page title from the filename, converts the HTML to Markdown, and saves it to a new file in a different directory with the .md replacing the original .html extension.

When you put all these components together, you get a script that converts a folder full of HTML files to Markdown. Just like with the markov chains extraction I mentioned at the beginning of this post, Bash and shell scripting really is all about lego bricks. This is due in part to the Unix philosophy:

Make each program do one thing well.

There is more to it, but this is the most important point to remember. Many of the core utilities you'll find on the terminal follow this way of thinking.

There's 1 last thing we need to take care of before we have them in the right format though - we need to convert the [display text](page name) markdown-format links back into the Wikipedia [[internal link]] format that Pepperminty Wiki also uses.

Thankfully, another command-line tool I know of called repren is well-suited to this:

repren --from '\[([^\]]+)\]\(([^):]+)\)' --to '[[\1]]' pages_md/*.md

It took some fiddling, but I got all the escaping figured out and the above converts back into the [[internal link]] format well enough.

Now that we've got our folder full of markdown files, we need to extract a random portion of them to act as a test for Pepperminty Wiki - as the whole lot might be a bit much for it to handle (though if Pepperminty Wiki was capable of handling it all eventually that'd be awesome :D). Let's try 500 pages to start:

find path/to/wikipages/ -type f -name "*.md" -print0 | shuf --zero-terminated | head -n500 --zero-terminated | xargs -0 -n1 -I{} cp "{}" .

(See this on explainshell.com)

This is another lego-brick style command. Let's break it down too:

find path/to/wikipages/ -type f -name "*.md" -print0

This lists all the .md files in a directory, delimiting them with a NUL character, as before. It's better to do this than use ls, as find is explicitly designed to be machine-readable.

.... | shuf --zero-terminated

The shuf command randomly shuffles the input lines. In this case, we're telling it that the input is delimited by the NUL byte.

.... | head -n500 --zero-terminated

Similar deal here. head takes the top N lines of input, and discards the rest.

.... | xargs -0 -n1 -I{} cp "{}" .

Finally, xargs calls cp to copy the selected files to the current directory - which is, in this case, the root directory of my test Pepperminty Wiki instance.

Since I'm curious, let's now find out roughly how many words we're dealing with here:

cat data_test/*.md | wc --words
1593190

1.5 million words! That's a lot. I wonder how quickly we can search that?

24.8ms? Awesome! That's so much better than before. If you're wondering about new coat of paint in the screenshot - Pepperminty Wiki is getting dark theme, thanks to prefers-color-scheme :D

I wonder what happens if we push it to 2K pages?

This time we get ~120ms for 5.9M total words - wow! I wasn't expecting it to perform so well. At this scale, rebuilding the entire index is particularly costly - so if I was to push it even further it would make sense to implement an incremental approach that spreads the work over multiple requests, assuming I can't squeeze any more performance out the system as-is.

The last thing I want to do here is make a rough estimate about time time complexity the search system as-is, given the data we have so far. This isn't particularly difficult to do.

Given the results above, we can calculate that at 1.5M total words, an increase of ~60K total words results in an increase of 1ms of execution time. At 5.9M words, it's only ~49K words / ms of execution time - a drop of ~11K words / ms of execution time.

From this, we can speculate that for every million words total added to a wiki, we can expect a ~2.5K words / ms of execution time drop - not bad! We'd need more data points to make any reasonable guess as to the Big-O complexity function that it conforms to. My guess would be something like $O(xN^2)$, where x is a constant between ~0.2 and 2.

Maybe at some point I'll go to the trouble of running enough tests to calculate it, but with all the variables that affect the execution time (number of pages, distribution of words across pages, etc.), I'm not in any hurry to calculate it. If you'd like to do so, go ahead and comment below!

Next time, I'll unveil the inner working of the STAS: my new search-term analysis system.

Found this interesting? Got your own story about some cool code you've written to tell? Comment below!

Stardust Blog

Tag Cloud

Inter-process communication between Javascript and Python

Python child process

Conclusion

Code

Building the science festival demo: How to monkeypatch an npm package

Tensorflow / Tensorflow.js in Review

Tensorflow for Python

Tensorflow.js

Conclusion

applause-cli: A Node.js CLI handling library

Rendering Time plan / Gantt charts: hourgraph

Skyliner: Automated text document outlining

PhD Aside: Reading a file descriptor line-by-line from multiple Node.js processes

The legend of the disappearing data in Node.js

Summer Project Series List

Next Gen Search, Part 2: Pushing the limits

Stardust
Blog