Stardust | Starbeamrainbowlabs

Summer Project Part 5: When is a function not a function?

Another post! Looks like I'm on a roll in this series :P

In the last post, I looked at the box I designed that was ready for 3D printing. That process has now been completed, and I'm now in possession of an (almost) luminous orange and pink box that could almost glow in the dark.......

I also looked at the libraries that I'll be using and how to manage the (rather limited) amount of memory available in the AVR microprocessor.

Since last time, I've somehow managed to shave a further 6% program space off (though I'm not sure how I've done it), so most recently I've been implementing 2 additional features:

An additional layer of AES encryption, to prevent The Things Network for having access to the decrypted data
GPS delta checking (as I'm calling it), to avoid sending multiple messages when the device hasn't moved

After all was said and done, I'm now at 97% program space and 47% global variable RAM usage.

To implement the additional AES encryption layer, I abused LMiC's IDEETRON AES-128 (ECB mode) implementation, which is stored in src/aes/ideetron/AES-128_V10.cpp.

It's worth noting here that if you're doing crypto yourself, it's seriously not recommended that you use ECB mode. Please don't. The only reason that I used it here is because I already had an implementation to hand that was being compiled into my program, I didn't have the program space to add another one, and my messages all start with a random 32-bit unsigned integer that will provide a measure of protection against collision attacks and other nastiness.

Specifically, it's the method with this signature:

void lmic_aes_encrypt(unsigned char *Data, unsigned char *Key);

Since this is an internal LMiC function declared in a .cpp source file with no obvious header file twin, I needed to declare the prototype in my source code as above - as the method will only be discovered by the compiler when linking the object files together (see this page for more information about the C++ compilation process. While it's for regular Linux executable binaries, it still applies here since the Arduino toolchain spits out a very similar binary that's uploaded to the microprocessor via a programmer).

However, once I'd sorted out all the typing issues, I slammed into this error:

/tmp/ccOLIbBm.ltrans0.ltrans.o: In function `transmit_send':
sketch/transmission.cpp:89: undefined reference to `lmic_aes_encrypt(unsigned char*, unsigned char*)'
collect2: error: ld returned 1 exit status

Very strange. What's going on here? I declared that method via a prototype, didn't I?

Of course, it's not quite that simple. The thing is, the file I mentioned above isn't the first place that a prototype for that method is defined in LMiC. It's actually in other.c, line 35 as a C function. Since C and C++ (for all their similarities) are decidedly different, apparently to call a C function in C++ code you need to declare the function prototype as extern "C", like this:

extern "C" void lmic_aes_encrypt(unsigned char *Data, unsigned char *Key);

This cleaned the error right up. Turns out that even if a function body is defined in C++, what matters is where the original prototype is declared.

I'm hoping to release the source code, but I need to have a discussion with my supervisor about that at the end of the project.

Found this interesting? Come across some equally nasty bugs? Comment below!

Own your code, Part 2: The curious case of the unreliable webhook

In the last post, I talked about how to setup your own Git server with Gitea. In this one, I'm going to take bit of a different tack - and talk about one of the really annoying problems I ran into when setting up my continuous integration server, Laminar CI.

Since I wanted to run the continuous integration server on a different machine to the Gitea server itself, I needed a way for the Gitea server to talk to the CI server. The natural choice here is, of course, a Webhook-based system.

After installing and configuring Webhook on the CI server, I set to work writing a webhook receiver shell script (more on this in a future post!). Unfortunately, it turned out that that Gitea didn't like sending to my CI server very much:

A ton of failed attempts at sending a webhook to the CI server

Whether it succeeded or not was random. If I hit the "Test Delivery" button enough times, it would eventually go through. My first thought was to bring up the Gitea server logs to see if it would give any additional information. It claimed that there was an i/o timeout communicating with the CI server:

Delivery: Post https://ci.bobsrockets.com/hooks/laminar-config-check: read tcp 5.196.73.75:54504->x.y.z.w:443: i/o timeout

Interesting, but not particularly helpful. If that's the case, then I should be able to get the same error with curl on the Gitea server, right?

curl https://ci.bobsrockets.com/hooks/testhook

.....wrong. It worked flawlessly. Every time.

Not to be beaten by such an annoying issue, I moved on to my next suspicion. Since my CI server is unfortunately behind NAT, I checked the NAT rules on the router in front of it to ensure that it was being exposed correctly.

Unfortunately, I couldn't find anything wrong here either! By this point, it was starting to get really rather odd. As a sanity check, I decided to check the server logs on the CI server, since I'm running Webhook behind Nginx (as a reverse-proxy):

5.196.73.75 - - [04/Dec/2018:20:48:05 +0000] "POST /hooks/laminar-config-check HTTP/1.1" 408 0 "-" "GiteaServer"

Now that's weird. Nginx has recorded a HTTP 408 error. Looking is up reveals that it's a Request Timeout error, which has the following definition:

The server did not receive a complete request message within the time that it was prepared to wait.

Wait what? Sounds to me like there's an argument going on between the 2 servers here - in which each server is claiming that the other didn't send a complete request or response.

At this point, I blamed this on a faulty HTTP implementation in Gitea, and opened an issue.

As a workaround, I ended up configuring Laminar to use a Unix socket on disk (as opposed to an abstract socket), forwarding it over SSH, and using a git hook to interact with it instead (more on how I managed this in a future post. There's a ton of shell scripting that I need to talk about first).

This isn't the end of this tail though! A month or two after I opened the issue, I wound up in the situation whereby I wanted to connect a GitHub repository to my CI server. Since I don't have shell access on github.com, I had to use the webhook.

When I did though, I got a nasty shock: The webhook deliveries exhibited the exact same random failures as I saw with the Gitea webhook. If I'd verified the Webhook server and cleared Gitea's HTTP implementation's name, then what else could be causing the problem?

At this point, I can only begin to speculate what the issue is. Personally, I suspect that it's a bug in the port-forwarding logic of my router, whereby it drops the first packet from a new IP address while it sets up a new NAT session to forward the packets to the CI server or something - so subsequent requests will go through fine, so long as they are sent within the NAT session timeout and from the same IP. If you've got a better idea, please comment below!

Of course, I really wanted to get the GitHub repository connected to my CI server, and if the only way I could do this was with a webhook, it was time for some request-wrangling.

My solution: A PHP proxy script running on the same server as the Gitea server (since it has a PHP-enabled web server set up already). If said script eats the request and emits a 202 Accepted immediately, then it can continue trying to get a hold of the webhook on the CI server 'till the cows come home - and GitHub will never know! Genius.

PHP-FPM (the fastcgi process manager; great alongside Nginx) makes this possible with the fastcgi_finish_request() method, which both flushes the buffer and ends the request to the client, but doesn't kill the PHP script - allowing for further processing to take place without the client having to wait.

Extreme caution must be taken with this approach however, as it can easily lead to a situation where the all the PHP-FPM processes are busy waiting on replies from the CI server, leaving no room for other requests to be fulfilled and a big messy pile-up in the queue forming behind them.

Warnings aside, here's what I came up with:

<?php

$settings = [
    "target_url" => "https://ci.bobsrockets.com/hooks/laminar-git-repo",
    "response_message" => "Processing laminar job proxy request.",
    "retries" => 3,
    "attempt_timeout" => 2 // in seconds, for a single attempt
];

$headers = "host: ci.starbeamrainbowlabs.com\r\n";
foreach(getallheaders() as $key => $value) {
    if(strtolower($key) == "host") continue;
    $headers .= "$key: $value\r\n";
}
$headers .= "\r\n";

$request_content = file_get_contents("php://input");

// --------------------------------------------

http_response_code(202);
header("content-type: text/plain");
header("content-length: " . strlen($settings["response_message"]));
echo($settings["response_message"]);

fastcgi_finish_request();

// --------------------------------------------

function log_message($msg) {
    file_put_contents("ci-requests.log", $msg, FILE_APPEND);
}

for($i = 0; $i < $settings["retries"]; $i++) {
    $start = microtime(true);

    $context = stream_context_create([
        "http" => [
            "header" => $headers,
            "method" => "POST",
            "content" => $request_content,
            "timeout" => $settings["attempt_timeout"]
        ]
    ]);

    $result = file_get_contents($settings["target_url"], false, $context);

    if($result !== false) {
        log_message("[" . date("r") . "] Queued laminar job in " . (microtime(true) - $start_time)*1000 . "ms");
        break;
    }


    log_message("[" . date("r") . "] Failed to laminar job after " . (microtime(true) - $start_time)*1000 . "ms.");
}

I've named it autowrangler.php. A few things of note here:

php://input is a special virtual file that's mapped internally by PHP to the client's request. By eating it with file_get_contents(), we can get the entire request body that the client has sent to us, so that we can forward it on to the CI server.
getallheaders() lets us get a hold of all the headers sent to us by the client for later forwarding
I use log_message() to keep a log of the successes and failures in a log file. So far I've got a ~32% failure rate, but never more than 1 failure in a row - giving some credit to my earlier theory I talked about above.

This ends the tale of the recalcitrant and unreliable webhook. Hopefully you've found this an interesting read. In future posts, I want to look at how I configured Webhook, the inner workings of the git hook I mentioned above, and the collection of shell scripts I've cooked to that make my CI server tick in a way that makes it easy to add new projects quickly.

Found this interesting? Run into this issue yourself? Found a better ~~solution~~ workaround? Comment below!

Fixing recursive uploads with lftp: The tale of the rogue symbolic link

I've been setting up continuous deployment recently for an application I'm working on, and as part of this process I'm uploading the release with sftp, using a restricted user account that is both chrooted (though I use a subfolder of the home directory to be extra-sure) and doesn't have shell access.

Since the application is written in PHP, I use composer to manage the server-side PHP library dependencies - which works very well. The problems start when I try to upload the whole thing to the server - so I thought I'd make a quick post here on how I fixed it.

In a previous build step, I generate an archive for the release, and put it in the continuous integration (CI) archive folder.

In the deployment phase, it unpacks this compressed archive and then uploads it to the production server with lftp, because I need to do some fiddling about that I can't do with regular sftp (anyone up for a tutorial on this? I'd be happy to write a few posts on this). However, I kept getting this weird error in the CI logs:

lftp: MirrorJob.cc:242: void MirrorJob::JobFinished(Job*): Assertion `transfer_count>0' failed.
./lantern-build-engine/lantern.sh: line 173:  5325 Aborted                 $command_name $@

Very strange indeed! Apparently, lftp isn't known for outputting especially useful error messages when used in an automated script like this. I tried everything. I rewrote, refactored, and completely turned the whole thing upside-down multiple times. This, as you might have guessed, took quite a while.

Commits aside, it was only when I refactored it to do the upload via the regular sftp command like this that it became apparent what the problem was:

sftp -i "${SSH_KEY_PATH}" -P "${deploy_ssh_port}" -o PasswordAuthentication=no "${deploy_ssh_user}@${deploy_ssh_host}" << SFTPCOMMANDS
mkdir ${deploy_root_dir}/www-new
put -r ${source_upload_dir}/* ${deploy_root_dir}/www-new
bye
SFTPCOMMANDS

Thankfully, sftp outputs much more helpful error messages. I saw this in the CI logs:

.....
Entering /tmp/tmp.ssR3j7vGhC-air-quality-upload//vendor/nikic/php-parser/bin
Entering /tmp/tmp.ssR3j7vGhC-air-quality-upload//vendor/bin
php-parse: not a regular file

The last line there instantly told me what I needed to know: It was failing to upload a symbolic link.

The solution here was simple: Unwind the symbolic links into hard links instead, and then I'll still get the benefit of a link on the local disk, but sftp will treat it as a regular file and upload a duplicate.

This is done like so:

find "${temp_dir}" -type l -exec bash -c 'ln -f "$(readlink -m "$0")" "$0"' {} \;

Thanks to SuperUser for the above (though I would have expected to find it on the Unix Stack Exchange).

If you'd like to see the full deployment script I've written, you can find it here.

There's actually quite a bit of context to how I ended up encountering this problem in the first place - which includes things like CI servers, no small amount of bash scripting, git servers, and remote deployment.

In the future, I'd like to make a few posts about the exploration I've been doing in these areas - perhaps along the lines of "how did we get here?", as I think they'd make for interesting reading.....

Enabling ANSI Escape Codes on Windows 10

In a piece of assessed coursework (ACW) I've done recently, I built a text-based user interface rendering engine. Great fun, but when I went to run it on Windows - all I got was garbage in the console window! I found this strange, since support was announced a year or 2 back. They've even got an extensive documentation page on the subject!

(Above: ANSI escape sequences rendering on Windows. Hooray! Source: This forum post on the Intel® Developer Zone Forums)

The problem lay in the fact that unlike Linux, you actually need to enable it by calling an unmanaged function in the Windows API and flipping an undocumented bit. Sounds terrible? It is.

Thankfully, due to the .NET runtime be it Microsoft's official implementation or Mono handles references to DLLs, it's fairly easy to write a method that flips the appropriate bit in a portable fashion, which I'd like to document in this blog post for future reference.

Firstly, let's setup a method that only executes on Windows. That's easily achieved by checking Environment.OSVersion:

if(Environment.OSVersion.Platform.ToString().ToLower().Contains("win")) {
    IConsoleConfigurer configurer = new WindowsConsoleConfiguerer();
    configurer.SetupConsole();
}

Here, we inspect the platform we're running on, and if it contains the substring win, then we can assume that we're on Windows.

Then, in order to keep the unmanaged code calls as loosely coupled and as far from the main program as possible, I've put bit-flipping code itself in a separate class and referenced it via an interface. This is probably overkill, but at least this way if I run into any further compilation issues it won't be too difficult to refactor it into a separate class library and load it via reflection.

Said interface needs only a single method:

internal interface IConsoleConfigurer
{
    void SetupConsole();
}

....I've marked this as internal, as it's not (currently) going to be used outside the assembly it's located in. If that changes in the future, I can always mark it public instead.

The implementation of this interface is somewhat more complicated. Here it is:

/// <summary>
/// Configures the console correctly so that it processes ANSI escape sequences.
/// </summary>
internal class WindowsConsoleConfiguerer : IConsoleConfigurer
{
    const int STD_OUTPUT_HANDLE = -11;
    const uint ENABLE_VIRTUAL_TERMINAL_PROCESSING = 4;

    [DllImport("kernel32.dll", SetLastError = true)]
    static extern IntPtr GetStdHandle(int nStdHandle);

    [DllImport("kernel32.dll")]
    static extern bool GetConsoleMode(IntPtr hConsoleHandle, out uint lpMode);

    [DllImport("kernel32.dll")]
    static extern bool SetConsoleMode(IntPtr hConsoleHandle, uint dwMode);

    public void SetupConsole() {
        IntPtr handle = GetStdHandle(STD_OUTPUT_HANDLE);
        uint mode;
        GetConsoleMode(handle, out mode);
        mode |= ENABLE_VIRTUAL_TERMINAL_PROCESSING;
        SetConsoleMode(handle, mode);
    }
}

In short, the DllImport attributes and the extern keywords tell the .NET runtime that the method should be imported directly from a native DLL - not a .NET assembly.

The SetupConsole() method, that's defined by the IConsoleConfigurer interface above, then calls the native methods that we've imported - and because the .NET runtime only imports DLLs when they are first utilised, it compiles and runs just fine on Linux too :D

Found this helpful? Still having issues? Got a better way of doing this? Comment below!

Stardust Blog

Tag Cloud

Summer Project Part 5: When is a function not a function?

Own your code, Part 2: The curious case of the unreliable webhook

Fixing recursive uploads with lftp: The tale of the rogue symbolic link

Enabling ANSI Escape Codes on Windows 10

Stardust
Blog