Stardust | Starbeamrainbowlabs

PhD Aside: Reading a file descriptor line-by-line from multiple Node.js processes

Phew, that's a bit of a mouthful. We're taking a short break from the cluster series of posts (though those will be back next week I hope), because I've just run into a fascinating problem, the solution to which I thought I'd share here - since I didn't find a solution elsewhere on the web.

For my PhD, I've got a big old lump of data, and it all needs preprocessing before I train an AI model (or a variant thereof, since I'm effectively doing video-to-image translation). Unfortunately, one of the preprocessing steps is really slow. And because I'll naturally be training my AI for multiple epochs, the problem is multiplied.....

The solution, of course, is to do all the preprocessing up front such that I can just read the data in and push it directly into a Tensor in the right format. However, doing this on such a large dataset would take forever if I did the items 1 by 1. The thing is that Javascript isn't inherently multithreaded. I like this quote, as it describes the situation rather well:

In Javascript everything runs in parallel... except your code

--Felix Geisendörfer

In other words, when Node.js is reading or writing to and from the network, disk, or other places it can do lots of things at the same time because it does them asynchronously. The Javascript that gets executed though is only done on a single thread though.

This is great for io-bound tasks (such as a web server), as Node.js (a Javascript runtime) can handle many requests at the same time. On a side note, this is also the reason why Nginx is more efficient than Apache (because Nginx is event based too like Javascript, unlike Apache which is thread based).

It's not so great though for CPU bound tasks, such as the one I've got on my hands. All is not lost though, because Node.js has a number of useful functions inbuilt that we can use to tackle the issue.

Firstly, Node.js has a clever forking system. By using child_process.fork(), a single Node.js process can create multiple copies of itself to act as workers:

// main.js
import child_process from 'child_process';
import os from 'os';

let workers = [];

for(let i = 0; i &lt; os.cpus().length; i++) {
    workers.push(
        child_process.fork("worker.mjs")
    );
}

// worker.js
console.log(`Hello, world from a child process!`);

Very useful! The next much more sticky problem though is how to actually preprocess the data in a performant manner. In my specific case, I'm piping the data in from a shell script that decompresses a number of gzip archives in a specific order (as of the time of typing I have yet to implement this).

Because this is a single pipe we're talking about here, the question now arises of how to allow all the child processes to access the data that's coming in from the standard input of the master process.

I've actually encountered an issue like this one before. I initially tried reading it in on the master process, and then using worker.send(message) to send it to the worker processes for processing. This didn't end up working very well, because the master process became a bottleneck as it couldn't read from the standard input and send stuff to the workers fast enough.

With this in mind, I came up with a new plan. In Node.js, when you're forking to create a worker process, you can supply it with some custom file descriptors upon initialisation. So long as it has at least IPC (inter-process communication) channel for passing messages back and forth with the .send() and .on("message", (message) => ....) method and listeners, it doesn't actually care what you do with the others.

Cue file descriptor cloning:


// main.js
import child_process from 'child_process';
import os from 'os';

let workers = [];

for(let i = 0; i

I've highlighted the key line here (line 10 for those who can't see it). Here we tell it to clone file descriptors 0, 1, and 2 - which refer to stdin, stdout, and stderr respectively. This allows the worker processes direct access to the master process' stdin, stdout, and stderr.

With this, we can read from the same pipe with as many worker processes as we like - so long as they do so 1 at a time.

With this sorted, it gives rise to the next issue: reading line-by-line. Packages exist on npm (such as nexline, my personal favourite) to read from a stream line-by-line, but they have the unfortunate side-effect of maintaining a read buffer. While this is great for performance, it's not so great in my situation because it ends up scrambling the input! This is because said read buffer would be local to each worker process, so when the next worker along reads, it will skip a random number of bytes and start reading from the next bit along.

This means that I need to implement a custom method that reads a single line from a given file descriptor without maintaining a read buffer. I came up with this:

import fs from 'fs';

//  .....

// Global buffer to avoid unnecessary memory churn
let buffer = Buffer.alloc(4096);
function read_line_unbuffered(fd) {
    let i = 0;
    while(true) {
        let bytes_read = fs.readSync(fd, buffer, i, 1);
        if(bytes_read !== 1 || buffer[i] == 0x0A) {
            if(i == 0 && bytes_read == null) return null;
            return buffer.toString("utf-8", 0, i); // This is not inclusive, so we can abuse it to trim the \n off the end
        }

        i++;
        if(i == buffer.length) {
            let new_buffer = new Buffer(Math.ceil(buffer.length * 1.5));
            buffer.copy(new_buffer);
            buffer = new_buffer;
        }
    }
}

I read from the given file descriptor character by character directly into a buffer. As soon as it detects a new line character (\n, or character code 0x0A), it returns the new line. If we run out of space in the buffer, then we create a new larger one, copy the old buffer's contents into it, and keep going.

I maintain a global buffer here, because this helps to avoid unnecessary memory churn. In my case, the lines I'm reading in a rather long (hence the need to clone the file descriptor in the first place), and if I didn't keep a shared buffer I'd be allocating and deallocating a new pretty large buffer every time.

This also has the nice side-effect that we keep the largest buffer we've had to use so far around for next time, avoiding the need for subsequent copies to larger and larger buffers.

Finally, we can also guarantee that it won't be a problem if we call this multiple times, because as I explained above Javascript is single-threaded, so if we call the function multiple times in quick succession each read will happen 1 after another.

With this chain of Node.js features, we can read a large amount of data from and efficiently process the content of a pipe. The trick from here is to implement a proper messaging and locking system to avoid reading from the stream at the same time, and avoid write to the standard output at the same time.

Taking this further, I ended up with this:

(Licence: Mozilla Public Licence 2.0)

This correctly ensures that only 1 worker process reads from the stream at the same time. It doesn't do anything with the result though except log a message to the console, but when I implement that I'll implement a similar messaging system to ensure that only 1 process writes to the output at once.

On that note, my data is also ordered, so I'll have to implement a complicated cache system // ordering system to ensure that I write them to the standard output in the same order I read them in. When I do implement that, I'll probably blog about that too....

The main problem I still have with this solution is that I'm reading from the input stream. I haven't done any proper testing, but I'm pretty sure that doing so will be really slow. I not sure I can avoid this though and read a few KiBs at a time, because I don't currently know of any way to put the extra characters back into the input stream.

If anyone has a solution to that that increases performance, I'd love to know. Leave a comment below!

Exporting an SQLite3 database to a directory of CSV files

Recently I was working with a dataset I acquired for my PhD, and to pre-process said dataset into something more sensible I imported it into an SQLite3 database. Once I was finished processing it, I then needed to export it again into regular CSV files so that I could do other things, such as plot it with GNUPlot, or import it into InfluxDB (more on InfluxDB in a later post).

With the help of Stack Overflow and the SQLite3 man page, this didn't prove to be too difficult. To export a single SQLite3 table to a CSV file, you do this:

sqlite3 -bail -header -csv "bobsrockets.sqlite3" "SELECT * FROM 'table_name';" >"path/to/output_file.csv";

This is great for a single table, but what if we want to export all the tables? Well, we can iterate over all the tables in an SQLite3 database like so:

while read table_name; do
    echo "Exporting ${table_name}";

    # Do stuff
done < <(sqlite3 "bobsrockets.sqlite3" ".tables");

If we combine this with the previous snippet, we can export all the tables like so:

while read table_name; do
    log "Exporting ${table_name}";

    sqlite3 -bail -header -csv "bobsrockets.sqlite3" "SELECT * FROM '${table_name}';" >"${table_name}.csv"; 
done < <(sqlite3 "bobsrockets.sqlite3" ".tables");

Cool! We can make it even better with some simple improvements though:

It's a pain to have to edit the script every time we want to change the database we're exporting
It would be nice to be able to specify the output directory without editing the script too

Satisfying both of these points isn't particularly challenging. 10 minutes of fiddling got this the final completed script:

#!/usr/bin/env bash
set -e; # Don't allow errors

show_usage() {
    echo -e "Usage:";
    echo -e "\t./sqlite2csv.sh {db_filename} {output_dir}";
}

log() {
    echo -e "[ $(date +"%F %T") ] ${@}";
}

###############################################################################

db_filename="${1}";
output_dir="${2}";

if [ -z "${db_filename}" ]; then
    echo "Error: No database filename specified.";
    show_usage; exit;
fi
if [ -z "${output_dir}" ]; then
    echo "Error: No output directory specified.";
    show_usage; exit;
fi

if [ ! -d "${output_dir}" ]; then
    mkdir -p "${output_dir}"; 
fi

log "Output directory is ${output_dir}";

while read table_name; do
    log "Exporting ${table_name}";

    sqlite3 -bail -header -csv "${db_filename}" "SELECT * FROM '${table_name}';" >"${output_dir}/${table_name}.csv";    
done < <(sqlite3 "${db_filename}" ".tables");

log "Complete!";

Found this useful? Comment below!

Summer Project Part 5: When is a function not a function?

Another post! Looks like I'm on a roll in this series :P

In the last post, I looked at the box I designed that was ready for 3D printing. That process has now been completed, and I'm now in possession of an (almost) luminous orange and pink box that could almost glow in the dark.......

I also looked at the libraries that I'll be using and how to manage the (rather limited) amount of memory available in the AVR microprocessor.

Since last time, I've somehow managed to shave a further 6% program space off (though I'm not sure how I've done it), so most recently I've been implementing 2 additional features:

An additional layer of AES encryption, to prevent The Things Network for having access to the decrypted data
GPS delta checking (as I'm calling it), to avoid sending multiple messages when the device hasn't moved

After all was said and done, I'm now at 97% program space and 47% global variable RAM usage.

To implement the additional AES encryption layer, I abused LMiC's IDEETRON AES-128 (ECB mode) implementation, which is stored in src/aes/ideetron/AES-128_V10.cpp.

It's worth noting here that if you're doing crypto yourself, it's seriously not recommended that you use ECB mode. Please don't. The only reason that I used it here is because I already had an implementation to hand that was being compiled into my program, I didn't have the program space to add another one, and my messages all start with a random 32-bit unsigned integer that will provide a measure of protection against collision attacks and other nastiness.

Specifically, it's the method with this signature:

void lmic_aes_encrypt(unsigned char *Data, unsigned char *Key);

Since this is an internal LMiC function declared in a .cpp source file with no obvious header file twin, I needed to declare the prototype in my source code as above - as the method will only be discovered by the compiler when linking the object files together (see this page for more information about the C++ compilation process. While it's for regular Linux executable binaries, it still applies here since the Arduino toolchain spits out a very similar binary that's uploaded to the microprocessor via a programmer).

However, once I'd sorted out all the typing issues, I slammed into this error:

/tmp/ccOLIbBm.ltrans0.ltrans.o: In function `transmit_send':
sketch/transmission.cpp:89: undefined reference to `lmic_aes_encrypt(unsigned char*, unsigned char*)'
collect2: error: ld returned 1 exit status

Very strange. What's going on here? I declared that method via a prototype, didn't I?

Of course, it's not quite that simple. The thing is, the file I mentioned above isn't the first place that a prototype for that method is defined in LMiC. It's actually in other.c, line 35 as a C function. Since C and C++ (for all their similarities) are decidedly different, apparently to call a C function in C++ code you need to declare the function prototype as extern "C", like this:

extern "C" void lmic_aes_encrypt(unsigned char *Data, unsigned char *Key);

This cleaned the error right up. Turns out that even if a function body is defined in C++, what matters is where the original prototype is declared.

I'm hoping to release the source code, but I need to have a discussion with my supervisor about that at the end of the project.

Found this interesting? Come across some equally nasty bugs? Comment below!

Own your code, Part 2: The curious case of the unreliable webhook

In the last post, I talked about how to setup your own Git server with Gitea. In this one, I'm going to take bit of a different tack - and talk about one of the really annoying problems I ran into when setting up my continuous integration server, Laminar CI.

Since I wanted to run the continuous integration server on a different machine to the Gitea server itself, I needed a way for the Gitea server to talk to the CI server. The natural choice here is, of course, a Webhook-based system.

After installing and configuring Webhook on the CI server, I set to work writing a webhook receiver shell script (more on this in a future post!). Unfortunately, it turned out that that Gitea didn't like sending to my CI server very much:

A ton of failed attempts at sending a webhook to the CI server

Whether it succeeded or not was random. If I hit the "Test Delivery" button enough times, it would eventually go through. My first thought was to bring up the Gitea server logs to see if it would give any additional information. It claimed that there was an i/o timeout communicating with the CI server:

Delivery: Post https://ci.bobsrockets.com/hooks/laminar-config-check: read tcp 5.196.73.75:54504->x.y.z.w:443: i/o timeout

Interesting, but not particularly helpful. If that's the case, then I should be able to get the same error with curl on the Gitea server, right?

curl https://ci.bobsrockets.com/hooks/testhook

.....wrong. It worked flawlessly. Every time.

Not to be beaten by such an annoying issue, I moved on to my next suspicion. Since my CI server is unfortunately behind NAT, I checked the NAT rules on the router in front of it to ensure that it was being exposed correctly.

Unfortunately, I couldn't find anything wrong here either! By this point, it was starting to get really rather odd. As a sanity check, I decided to check the server logs on the CI server, since I'm running Webhook behind Nginx (as a reverse-proxy):

5.196.73.75 - - [04/Dec/2018:20:48:05 +0000] "POST /hooks/laminar-config-check HTTP/1.1" 408 0 "-" "GiteaServer"

Now that's weird. Nginx has recorded a HTTP 408 error. Looking is up reveals that it's a Request Timeout error, which has the following definition:

The server did not receive a complete request message within the time that it was prepared to wait.

Wait what? Sounds to me like there's an argument going on between the 2 servers here - in which each server is claiming that the other didn't send a complete request or response.

At this point, I blamed this on a faulty HTTP implementation in Gitea, and opened an issue.

As a workaround, I ended up configuring Laminar to use a Unix socket on disk (as opposed to an abstract socket), forwarding it over SSH, and using a git hook to interact with it instead (more on how I managed this in a future post. There's a ton of shell scripting that I need to talk about first).

This isn't the end of this tail though! A month or two after I opened the issue, I wound up in the situation whereby I wanted to connect a GitHub repository to my CI server. Since I don't have shell access on github.com, I had to use the webhook.

When I did though, I got a nasty shock: The webhook deliveries exhibited the exact same random failures as I saw with the Gitea webhook. If I'd verified the Webhook server and cleared Gitea's HTTP implementation's name, then what else could be causing the problem?

At this point, I can only begin to speculate what the issue is. Personally, I suspect that it's a bug in the port-forwarding logic of my router, whereby it drops the first packet from a new IP address while it sets up a new NAT session to forward the packets to the CI server or something - so subsequent requests will go through fine, so long as they are sent within the NAT session timeout and from the same IP. If you've got a better idea, please comment below!

Of course, I really wanted to get the GitHub repository connected to my CI server, and if the only way I could do this was with a webhook, it was time for some request-wrangling.

My solution: A PHP proxy script running on the same server as the Gitea server (since it has a PHP-enabled web server set up already). If said script eats the request and emits a 202 Accepted immediately, then it can continue trying to get a hold of the webhook on the CI server 'till the cows come home - and GitHub will never know! Genius.

PHP-FPM (the fastcgi process manager; great alongside Nginx) makes this possible with the fastcgi_finish_request() method, which both flushes the buffer and ends the request to the client, but doesn't kill the PHP script - allowing for further processing to take place without the client having to wait.

Extreme caution must be taken with this approach however, as it can easily lead to a situation where the all the PHP-FPM processes are busy waiting on replies from the CI server, leaving no room for other requests to be fulfilled and a big messy pile-up in the queue forming behind them.

Warnings aside, here's what I came up with:

<?php

$settings = [
    "target_url" => "https://ci.bobsrockets.com/hooks/laminar-git-repo",
    "response_message" => "Processing laminar job proxy request.",
    "retries" => 3,
    "attempt_timeout" => 2 // in seconds, for a single attempt
];

$headers = "host: ci.starbeamrainbowlabs.com\r\n";
foreach(getallheaders() as $key => $value) {
    if(strtolower($key) == "host") continue;
    $headers .= "$key: $value\r\n";
}
$headers .= "\r\n";

$request_content = file_get_contents("php://input");

// --------------------------------------------

http_response_code(202);
header("content-type: text/plain");
header("content-length: " . strlen($settings["response_message"]));
echo($settings["response_message"]);

fastcgi_finish_request();

// --------------------------------------------

function log_message($msg) {
    file_put_contents("ci-requests.log", $msg, FILE_APPEND);
}

for($i = 0; $i < $settings["retries"]; $i++) {
    $start = microtime(true);

    $context = stream_context_create([
        "http" => [
            "header" => $headers,
            "method" => "POST",
            "content" => $request_content,
            "timeout" => $settings["attempt_timeout"]
        ]
    ]);

    $result = file_get_contents($settings["target_url"], false, $context);

    if($result !== false) {
        log_message("[" . date("r") . "] Queued laminar job in " . (microtime(true) - $start_time)*1000 . "ms");
        break;
    }


    log_message("[" . date("r") . "] Failed to laminar job after " . (microtime(true) - $start_time)*1000 . "ms.");
}

I've named it autowrangler.php. A few things of note here:

php://input is a special virtual file that's mapped internally by PHP to the client's request. By eating it with file_get_contents(), we can get the entire request body that the client has sent to us, so that we can forward it on to the CI server.
getallheaders() lets us get a hold of all the headers sent to us by the client for later forwarding
I use log_message() to keep a log of the successes and failures in a log file. So far I've got a ~32% failure rate, but never more than 1 failure in a row - giving some credit to my earlier theory I talked about above.

This ends the tale of the recalcitrant and unreliable webhook. Hopefully you've found this an interesting read. In future posts, I want to look at how I configured Webhook, the inner workings of the git hook I mentioned above, and the collection of shell scripts I've cooked to that make my CI server tick in a way that makes it easy to add new projects quickly.

Found this interesting? Run into this issue yourself? Found a better ~~solution~~ workaround? Comment below!

Note to self: Don't reboot the server at midnight....

You may (or may not) have noticed a small window of ~3/4 hour the other day when my website was offline. I thought I'd post about the problem, the solution, and what I'll try to avoid next time.

The problem occurred when I was about to head to bed late at night. I decided to quickly reboot the server to reboot into a new kernel to activate some security updates.

I have this habit of leaving a ping -O hostname running in a separate terminal to monitor the progress of the reboot. I'm glad I did so this time, as I noticed that it took a while to go down for rebooting. Then it took an unusually long time to come up again, and when it did, I couldn't SSH in again!

After a quick check, the website was down too - so it was time to do something about it and fast. Thankfully, I already knew what was wrong - it was just a case of fixing it.....

In a Linux system, there's a file called /etc/fstab that defines all the file systems that are to be mounted. While this sounds a bit counter-intuitive (since how does it know to mount the filesystem that the file itself described how to mount?), it's built into the initial ramdisk (also this) if I understand it correctly.

There are many different types of file system in Linux. Common ones include ext4 (the latest Linux filesystem), nfs (Network FileSystem), sshfs (for mounting remote filesystems over SSH), davfs (WebDav shares), and more.

Problems start to arise when some of the filesystems defined in /etc/fstab don't mount correctly. WebDav filesystems are notorious for this, I've found - so they generally need to have the noauto flag attached, like this:

https://dav.bobsrockets.com/path/to/directory   /path/to/mount/point    davfs   noauto,user,rw,uid=1000,gid=1000    0   0

Unfortunately, I forgot to do this with the webdav filesystem I added a few weeks ago, causing the whole problem in the first place.

The unfortunate issue was that since it couldn't mount the filesystems, jt couldn't start the SSH server. If it couldn't start the SSH server, I couldn't get in to fix it!

Kimsufi rescue mode to the, erm rescue! It turned out that my provider, KimSufi, have a rescue mode system built-in for just this sort of occasion. At the click of a few buttons, I could reboot my server into a temporary rescue environment with a random SSH password.

Therein I could mount the OS file system, edit /etc/fstab, and reboot into normal mode. Sorted!

Just a note for future reference: I recommend using the rescuepro rescue mode OS, and not either of the FreeBSD options. I had issues trying to mount the OS disk with them - I kept getting an Invalid argumennt error. I was probably doing something wrong, but at the time I didn't really want to waste tones of time trying to figure that out in an unfamiliar OS.

Hopefully there isn't a next time. I'm certainly going to avoid auto webdav mounts, instead spawning a subprocess to mount them in the background after booting is complete.

I'm also going to avoid rebooting my server when I don't have time to deal with anyn potential fallout....

Fixing recursive uploads with lftp: The tale of the rogue symbolic link

I've been setting up continuous deployment recently for an application I'm working on, and as part of this process I'm uploading the release with sftp, using a restricted user account that is both chrooted (though I use a subfolder of the home directory to be extra-sure) and doesn't have shell access.

Since the application is written in PHP, I use composer to manage the server-side PHP library dependencies - which works very well. The problems start when I try to upload the whole thing to the server - so I thought I'd make a quick post here on how I fixed it.

In a previous build step, I generate an archive for the release, and put it in the continuous integration (CI) archive folder.

In the deployment phase, it unpacks this compressed archive and then uploads it to the production server with lftp, because I need to do some fiddling about that I can't do with regular sftp (anyone up for a tutorial on this? I'd be happy to write a few posts on this). However, I kept getting this weird error in the CI logs:

lftp: MirrorJob.cc:242: void MirrorJob::JobFinished(Job*): Assertion `transfer_count>0' failed.
./lantern-build-engine/lantern.sh: line 173:  5325 Aborted                 $command_name $@

Very strange indeed! Apparently, lftp isn't known for outputting especially useful error messages when used in an automated script like this. I tried everything. I rewrote, refactored, and completely turned the whole thing upside-down multiple times. This, as you might have guessed, took quite a while.

Commits aside, it was only when I refactored it to do the upload via the regular sftp command like this that it became apparent what the problem was:

sftp -i "${SSH_KEY_PATH}" -P "${deploy_ssh_port}" -o PasswordAuthentication=no "${deploy_ssh_user}@${deploy_ssh_host}" << SFTPCOMMANDS
mkdir ${deploy_root_dir}/www-new
put -r ${source_upload_dir}/* ${deploy_root_dir}/www-new
bye
SFTPCOMMANDS

Thankfully, sftp outputs much more helpful error messages. I saw this in the CI logs:

.....
Entering /tmp/tmp.ssR3j7vGhC-air-quality-upload//vendor/nikic/php-parser/bin
Entering /tmp/tmp.ssR3j7vGhC-air-quality-upload//vendor/bin
php-parse: not a regular file

The last line there instantly told me what I needed to know: It was failing to upload a symbolic link.

The solution here was simple: Unwind the symbolic links into hard links instead, and then I'll still get the benefit of a link on the local disk, but sftp will treat it as a regular file and upload a duplicate.

This is done like so:

find "${temp_dir}" -type l -exec bash -c 'ln -f "$(readlink -m "$0")" "$0"' {} \;

Thanks to SuperUser for the above (though I would have expected to find it on the Unix Stack Exchange).

If you'd like to see the full deployment script I've written, you can find it here.

There's actually quite a bit of context to how I ended up encountering this problem in the first place - which includes things like CI servers, no small amount of bash scripting, git servers, and remote deployment.

In the future, I'd like to make a few posts about the exploration I've been doing in these areas - perhaps along the lines of "how did we get here?", as I think they'd make for interesting reading.....

Troubleshooting my dotnet setup

I've recently been setting up dotnet on my Artix Linux laptop for my course at University. While I'm unsure precisely what dotnet is intended to do (and how it's different to Mono), my current understanding is that it's an implementation of .NET Core intended for developing and running ASP.NET web applications (there might be more on ASP.NET in a later 'first impressions' post soon-ish).

While the distribution is somewhat esoteric (it's based on Arch Linux), I've run into a number of issues with the installation process and getting Monodevelop to detect it - and if what I've read whilst researching said issues, they aren't confined to a single operating system.

Since I haven't been able to find any concrete instructions on how to troubleshoot the installation for the specific issues I've been facing, I thought I'd blog about it to help others out.

Installation on Arch-based distributions is actually pretty easy. I did this:

sudo pacman -S dotnet-sdk

Easy!

Monodevelop + dotnet = headache?

After this, I tried opening Monodevelop - and found an ominous message saying something along the lines of ".NET Core SDK 2.2 is not installed". Strange. If I try dotnet in the terminal, I get something like this:

$ dotnet
Usage: dotnet [options]
Usage: dotnet [path-to-application]

.....

Turns out that it's a known bug. Sadly, there doesn't appear to be much interest in fixing it - and neither does there appear to be much information about how Monodevelop does actually detect a dotnet installation.

Thankfully, I've deciphered the bug report and done all the work for you :P The bug report appears to suggest that Monodevelop expects dotnet to be installed to the directory /usr/share/dotnet. My system didn't install it there, so went looking to find where it did install it to. Doing this:

whereis dotnet

Yielded just /usr/bin/dotnet. My first thought was that this was a symbolic link to the binary in the actual install directory, so I tried this to see:

ls -l /usr/bin/dotnet

Sadly, it was not to be. Instead of a symbolic link, I found instead what appeared to be a binary file itself - which could also be a hard link. Not to be outdone, I tried a more brute-force approach to find it:

sudo find / -mount -type d -name "dotnet"

Success! This gave a list of all directories on my main / root partition that are called dotnet. From there, it was easy to pick out that it actually installed it to /opt/dotnet.

Instead of moving it from the installation directory and potentially breaking my package manager, I instead opted to create a new symbolic link:

sudo ln -s /opt/dotnet /usr/share/dotnet

This fixed the issue, allowing Monodevelop to correctly detect my installation of dotnet.

Templates

Thinking my problems were over, I went to create a new dotnet project following a tutorial. Unfortunately, I ran into a number of awkward and random errors - some of which kept changing from run to run!

I created the project with the dotnet new subcommand like this:

dotnet new --auth individual mvc

Apparently, the template projects generated by the dotnet new subcommand are horribly broken. To this end, I re-created my project through Monodevelop with the provided inbuilt templates. I was met with a considerable amount more success here than I was with dotnet new.

HTTPS errors

The last issue I've run into is a large number of errors relating to the support for HTTPS that's built-in to the dotnet SDK.

Unfortunately, I haven't been able to resolve these. To this end, I disabled HTTPS support. Although this sounds like a bad idea, my reasoning is that in production, I would always have the application server itself run plain-old HTTP - and put it behind a reverse-proxy like Nginx that provides HTTPS, as this separates concerns. It also allows me to have just a single place that implements HTTPS support - and a single place that I have to constantly tweak and update to keep the TLS configuration secure.

To this end, there are 2 things you've got to do to disable HTTPS support. Firstly, in the file Startup.cs, find and comment out the following line:

app.UseHttpsRedirection();

In a production environment, you'll probably have your reverse-proxy configured to do this HTTP to HTTPS redirection anyway - another instance of separating concerns.

The other thing to do is to alter the endpoint and protocol that it listens on. Right click on the project name in the solution pane, click "Options", then "Run -> Configurations -> Default", then the "ASP.NET Core" tab, and remove the s in `https in the "App URL" box like this:

A screenshot of the above option that needs changing.

By the looks of things, you'll have to do this 2nd step on every machine you develop on - unless you also untick the "user-specific" box (careful you don't include any passwords etc. in the environment variables in the opposite tab in that case).

You may wish to consider creating a new configuration that has HTTPS disabled if you want to avoid changing the default configuration.

Found this useful? Got a related issue you've managed to fix? Comment below!

Search Engine Optimisation: The curious question of efficiency

For one reason or another I found myself a few days ago inspecting the code behind Pepperminty Wiki's full-text search engine. What I found was interesting enough that I thought I'd blog about it.

Forget about that kind of Search Engine Optimisation (the horrible click-baity kind - if there's enough interest I'll blog about my thoughts there too) and cue the appropriate music - we're going on a field trip fraught with the perils of Unicode, page ids, transliteration, and more!

Firstly, I should probably mention a little about the starting point. The (personal) wiki that is exhibiting the slowness has 546 ~75K words spread across 546 pages. Pepperminty Wiki manages to search all of this in about 2.8 seconds by way of an inverted index. If you haven't read my last post, you should do so now - it sets the stage for this one - and you'll be rather confused otherwise.

2.8 seconds is far too slow though. Let's do something about it! In order to do something about it, there are several other things that need explaining before I can show you what I did to optimise it. Let's look at Pepperminty Wiki's search system first. It's best explained with the aid of a diagram:

In short, every page has a numerical id, which is tracked by the ids core class. The search system interacts with this during the indexing phase (that's a topic for another blog post) and during the query phase. The query phase works something like this:

The inverted index is loaded from disk in my personal wiki the inverted index is ~968k, and loads in ~128ms)
The inverted index is queried for pages in that match the tokenised query terms.
The results returned from the query are ranked and sorted before being returned.
A context is extracted from the source of each page in the results returned - just like Duck Duck Go or Google have a bit of text below the title of each result
Said context has the search terms hightlighted

It sounds complicated, but it really isn't. The complicated bit comes when I tried to optimise it. To start with, I analysed how long each of the above steps were taking. The results were quite surprising:

Step #1 took ~128ms on average
Steps #2 & #3 took ~1200ms on average
Step #4 took ~1500ms on average(!)
Step #5 took a negligible amount of time

I did this by setting headers on the returned page. Timing things in PHP is relatively easy:

$start_time = microtime(true);

// Do work here

$end_time = microtime(true);

$time_taken_ms = round(($end_time - $start_time )*1000, 3);

This gave me a general idea as to what needed attention. I was surprised to learn that the context extractor was taking most of the time. At first, I thought that my weird and probably inefficient algorithm was to blame. There's no way it should be taking 1500ms!

So I set to work rewriting it to make it more optimal. Firstly, I tried something like this. Instead of multiple sub-loops, I figured out a way to do it with just 1 for loop and a few calls to mb_stripos_all().

Unfortunately, this did not have the desired effect. While it did shave about 50ms off the total time, it was far from what I'd hoped for. I tried refactoring it slightly again to use preg_match_all(), but it still didn't give me the speed boost I was after - only another 50ms or so.

To get some answers, I brought out the big guns and profiled it with XDebug.

Upon analysing the generated profile it immediately became clear what the issue was: transliteration. Transliteration is the process of removing the diacritics and other accents from a string to make it easier to compare with other strings. For example, café becomes Café. In PHP this process is a bit funky. Here's what I do in Pepperminty Wiki:

$literator = Transliterator::createFromRules(':: Any-Latin; :: Latin-ASCII; :: NFD; :: [:Nonspacing Mark:] Remove; :: Lower(); :: NFC;', Transliterator::FORWARD);

$literator->transliterate($string);

(Originally from this StackOverflow answer)

Note that this requires the intl PHP extension (which should be installed & enabled by default). This does several things:

Converts the text to lowercase
Normalises it to UTF-8 (See this article for more information)
Casts Cyrillic characters to their Latin alphabet (i.e. a-z) phonetic equivalent
Removes all diacritics

In short, it preprocesses a chunk of text so that it can be easily used by the search system. In my case, I transliterate search queries before tokenising them, source texts before indexing them, and crucially: source texts before extracting contextual information.

The thing about this wonderful transliteration process is that, at least in PHP, it's really slow. Thinking about it, the answer was obvious. Why bother extract offset information when the inverted index already contains that information?

The answer is: you don't upon refactoring the context extractor to utilise the inverted index, I managed to get it down to just ~59ms. Success!

Next up was the query system itself. 1200ms seems a bit high, so while I was at it, I analysed a profile of that as well. It turned out that a similar problem was occurring here too. Surprisingly, the page id system's getid($pagename) function was being really slow. I found 2 issues here.

Firstly, I was doing too much Unicode normalisation. In the page id system, I don't want to transliterate to remove diacritics, but I do want to make sure that all the diacritics and accents are represented in the same way.

If you didn't know, Unicode has a both a character for letters like é (e-acute), and a code-point for the acute accent itself, which gets merged into the previous letter during rendering. This can cause a page to acquire 2 (or even more!) seemingly identical ids in the system, which caused me a few headaches in the past! If you'd like to learn more, the article on Unicode normalisation I linked to above explains it in some detail. Thankfully, the solution is quite simple. Here's what Pepperminty Wiki does:

Normalizer::normalize($string, Normalizer::FORM_C)

This ensures that all accents and other weird characters are represented in the same way. As you might guess though, it's slow. I found that in the getid() function I was normalising both the page names I was iterating over in the index, as well as the target page name to find in every loop. The solution here was simple:

Don't normalise the page names from the index - that's the job of the assign() protected method to ensure that they are suitably normalised when adding them in the first place
Normalise the target page name only once, and then use that when spinning through the index.

Implementing these simple changes brought the overall search time down to 700ms. The other thing to note here is the structure of the index. I show it in the diagram, but here it is again:

1: Booster
2: Rocket
3: Satellite

The index is basically a hash-table mapping numerical ids to their page names. This is great for when you have an id and want to know what the name of the page associated with it is, but terrible for when you want to go in the other direction, as we need to do when performing a query!

I haven't quite decided what to do about this. Obviously, the implications on efficiency are significant whenever we need to convert a page name into its respective numerical id. The problem lies in the fact that the search query system travels in both directions: It needs to convert page ids into page names when unravelling the results from the inverted index, but it also needs to convert page names into their respective ids when searching the titles and tags in the page index (the index that contains information about all the pages on a wiki - not pictured in the diagram above).

I have several options that I can see immediately:

Maintain 2 indexes: One going in each direction. This would also bring a minor improvement to indexing new and updating existing content in the inverted index.
Use some fancy footwork to refactor the search query system to unwind the page ids into their respective page names before we search the pages' titles and tags.

While deciding what to do, I did manage to reduce the number of times I convert a page name into its respective id by only performing the conversion if I find a match in the page metadata. This brought the average search time down to ~455ms, which is perfectly fine for my needs at the moment.

In the future, I may come back to this and optimise it further - but as it stands I'm getting to the point of diminishing returns: Where every additional optimisation requires twice the amount of time to implement as the last, and only provides a marginal gain in speed.

To this end, it doesn't seem worth it to spend ages tackling this issue now. Pepperminty Wiki is written in such a way that I can come back later and turn the inner workings of any part of the system upside-down, and it doesn't have any effect on the rest of the system (most of the time, anyway.... :P).

If you do find the search system too slow after these optimisations are released in v0.17, I'd like to hear about it! Please open an issue and I'll investigate further - I certainly haven't reached the end of this particular lollipop.

Found this interesting? Learnt something? Got a better way of doing it? Comment below!

Retinex: Correct your low-light images today!

I was processing some images for someone recently, and I ended up encountering issues with colour balance. The images looked okay on my monitor, but as soon as I printed them out, they took on a slight red-orange tint. Very interesting. I suspect that the root cause lies in some complex colourspace or device colour profile issue (which will take me ages to debug and track down), but I stumbled upon a filter in GIMP called Retinex, which provided a very useful workaround.

According to the GIMP documentation, retinex is an algorithm that improves the appearance of images that were taken in sub-optimal lighting conditions. It's probably best illustrated with an example:

An example of the retinex filter in action.

(Above: An example of the retinex filter in action. Image source: The official GIMP documentation.)

As you can see, the things on the desk are much easier to pick out in the right image as compared to the left one. Apparently, the algorithm was invented at NASA's Langley Research Centre in 2004 to automatically enhance astronomical photographs - and has a full name of Multi-Scale Retinex with Color Restoration (MSRCR) - which is a bit of mouthful!

During my own testing, I've found it be most effective on outdoor pictures, or pictures with poor lighting. I've also found it to be rather prone to introducing noise into the image - so if a simple automatic white balance correction will suffice, then that's probably a better filter to apply than this one.

It's one of those things that's really useful to know about - because it might just solve your problem one day! To that end, I wanted to blog about it so that I don't forget :P

Stardust
Blog

PhD Aside: Reading a file descriptor line-by-line from multiple Node.js processes

Exporting an SQLite3 database to a directory of CSV files

Summer Project Part 5: When is a function not a function?

Own your code, Part 2: The curious case of the unreliable webhook

Note to self: Don't reboot the server at midnight....

Fixing recursive uploads with lftp: The tale of the rogue symbolic link

Troubleshooting my dotnet setup

Monodevelop + dotnet = headache?

Templates

HTTPS errors

Search Engine Optimisation: The curious question of efficiency

Retinex: Correct your low-light images today!

Sources and Further Reading

Stardust Blog

Tag Cloud

PhD Aside: Reading a file descriptor line-by-line from multiple Node.js processes

Exporting an SQLite3 database to a directory of CSV files

Summer Project Part 5: When is a function not a function?

Own your code, Part 2: The curious case of the unreliable webhook

Note to self: Don't reboot the server at midnight....

Fixing recursive uploads with lftp: The tale of the rogue symbolic link

Troubleshooting my dotnet setup

Monodevelop + dotnet = headache?

Templates

HTTPS errors

Search Engine Optimisation: The curious question of efficiency

Retinex: Correct your low-light images today!

Sources and Further Reading

Stardust
Blog