Stardust | Starbeamrainbowlabs

Rendering Time plan / Gantt charts: hourgraph

I have a number of tools and other programs I've implemented, but forgotten to blog about here - hourgraph is one such tool I stumbled across today again. Originally I implemented it for my PhD panel 1 topic project analysis report, as I realised that not only have I manually created a number of these, but I'm going to have to create a bunch more in the future, but I open-sourced it as I usually do with most of the things I write in the hopes that someone else will find it useful.

I've published it on NPM, so you can install it like this:

npm install --global hourgraph

You'll need Node.js installed, and Linux users will need to prefix the above with sudo.

The program takes in a TOML definition file. Here's an example:

width = 1500
height = 480
title = "Apples"

[[task]]
name = "Pick apples"
start = 0
duration = 3

[[task]]
name = "Make apple juice"
start = 2
duration = 2

[[task]]
name = "Enjoy!"
start = 4
duration = 4
colour = "hsl(46, 90%, 60%)"
ghost_colour = "hsla(46, 90%, 60%, 0.1)"

The full set of options are available in the default config file, which is loaded in to fill in any gaps of things you haven't specified in your custom file.

Comprehensive usage instructions are found in the README, but you can render a new time plan chart thingy like this:

hourgraph --input path/to/input.toml --output path/to/output.toml

The above renders to this:

Hourgraph output

Personally, I find it's much easier to create charts like this by defining them in a simple text file that is then rendered into the actual thing. That way, I don't have to fiddle with the layout myself - it all comes out in the wash automatically.

For those interested in the code, it can be found here: https://github.com/sbrl/hourgraph

EmbedBox: Lightweight syntax-highlighted embeds

I was planning posting about something else yesterday, but I wanted to show some GitLab code in a syntax-highlighted embed. When I wasn't able to figure out how to do that, I ended up writing EmbedBox.

The whole thing is best explained with an example. Have an embed:

(Can't see the above? Check out the original file here)

Pretty cool, right? The above is the default settings file for EmbedBox. Given any URL (e.g. https://raw.githubusercontent.com/sbrl/EmbedBox/master/src/settings.default.toml), it will generate a syntax-highlighted embed for it.

It does so using highlight.php to do the syntax-highlighting server-side, Stash PHP for the cache, and without any Javascript in the embed itself.

It comes with a web interface that generates the embed code given the input URL and a few other settings and shows a preview of what it'll look like.

EmbedBox is open-source too (under the Mozilla Public Licence 2.0), so you're welcome to setup your own instance!

To do so, check out the code here: https://github.com/sbrl/EmbedBox/

The installation instructions should be pretty straightforward in theory, but if you get stuck please open an issue.

Now that I've implemented EmbedBox, you can expect to see it appear in future blog posts. I'm planning to write about my organise-photos script in the near future, so expect a blog post about it soon.

Found this interesting? Got a suggestion? Want to say hi? Comment below!

Starting my PhD on the mapping of flooding

Specifically, using new technologies such as AI and the Internet of Things to map and predict where it's going to flood in real-time.

This year, I'm starting a 3 year funded PhD on dynamic flood risk mapping, as part of a cluster of water-related PhDs that are all being funded at the same time. I've got some initial ideas as to the direction I'm going to take it too, which I'd like to talk a little bit about in this post.

I've got no shortage of leads as to potential data sources I can explore to achieve this. Just some of the avenues I'm exploring include:

Monitoring street drains
Fitting local council vehicles with sensors
Analysing geotagged tweets with natural language processing techniques
Predicting rainfall with aggregate mobile phone signal strength information
Vehicle windscreen wipers
Setting up static sensors? Not sure on this one.

Also, I've talked to a few people and thought of some pre-existing data sources that might come in useful:

Elevation maps
Vegetation maps
- Normalised difference vegetation index - Map of vegetation density that's already available
- Normalised difference water index - detects water on the surface of the earth - including that contained within leaves
River levels

Finally, I've been thinking about what I'm going to be doing with all this data I'd potentially be collecting. First and foremost, I'm going to experiment with InfluxDB as a storage mechanism. Apparently it's supposed to be able to handle high read and write loads, so it sounds suitable a first glance.

Next, I'm probably going to wind up training an AI - possibly incrementally - to predict flooding. Unlike my summer project, I'm probably going to be using a more cutting-edge and exotic AI architecture.

I suspect I might end up with a multi-level system too - whereby I pre-analyse the incoming data and process it into a format that the AI will take. For example, if I end up using geotagged social media posts, those will very likely filter through an AI of some description that does the natural language processing first - the output of which will be (part of) the input (or training output?) for the next AI in the chain.

I've given some thought to training data too. My current thinking is that while some data sources might be used as inputs to a network of interconnected AIs, others are more likely to take on a corrective role - i.e. improving the accuracy of the AI and correcting the model to fit a situation as it unfolds.

All this will not only require huge amounts of data, but will also require a sophisticated system that's capable of training an AI on past datasets as if they were happening in real-time. I suppose in a way the training process is a bit like chronological history lessons at speed - catching the AI up to the present day so that it can predict flood risk in real-time.

Before all this though, my first and foremost task will be to analyse what people have done already to map and predict flood risk. Only by understanding the current state of things will I be able to improve on what's already out there.

Found this interesting? Got any tips? Comment below!

Summer Project Part 6: A matching bookend

Since the last post, I've completed the project - at least in it's initial form. The IoT device I've been building is finished - along with the backend that receives the data and trains an AI on it. I've learnt a great deal whilst working on it. Unfortunately, I've been very short on time, so I haven't been able to blog about it as frequently as I'd have liked to.

Before we continue, it's perhaps best to show a diagram here of how the completed system works:

A colour-coded diagram. See the full explanation below. (Above: A diagram showing the typical workflow when using the completed system. Taken from my dissertation.).

Operation is divided up into 3 sections:

Data collection: Using the Internet of Things (IoT) device to collect data.
Processing the collected data: The data file on the microSD card of the device is folded into the main database (more on this later), and AIs are trained
Viewing the final map in your browser.

Because the application is somewhat distributed in nature, it's not exactly obvious how the data flows across the application. Let's use another diagram to show how this works:

Another brightly-coloured diagram - this time showing the data flow throughout the application. Explanation below. (Above: A diagram showing the data flow throughout the application. Taken from my dissertation.)

This diagram looks complicated, but it really isn't as complex as it looks.

First, the IoT device takes a reading. This includes the GPS location (latitude and longitude) and a random id (unsigned 32-bit integer)
This reading is both stored on a local SD card, and transmitted via LoRa
The copy transmitted via LoRa travels through The Things Network and is picked up by the TTN listener, which stores the reading in an SQLite database
Once data collection has been completed, the data on the microSD card is folded into the SQLite database generated by the TTN listener by the data processor. This ensures we have data points for places that the signal isn't, as well as where it is.
1 AI is then trained per gateway. These AIs are trains on the signal strength data for their gateway, but also the data for where the signal is not.
The trained AIs are serialised to disk with an index file, where the web interface picks them up.

Training the AIs was a bit of a pain. Getting the inputs and outputs right along with the AIs topology to a point where the output is meaningful was quite challenging - especially considering the limited amount of data I managed to collect. I'd like to tweak and improve on it further, if I have time.

I ended up getting quite a bit of noise in the final dataset that I trained the AIs with - simply because of the way I designed the IoT device - and because of the time constraints, I didn't have the time to go back and fix it. The device currently logs the data point to disk and transmits it via LoRa afterwards. When the battery was low, I noticed that it had a tendency to reset when transmitting via LoRa, leading to a reset-loop and lots of bogus readings on the microSD card. And because I don't store the timestamp, I can't even tell them apart!

I also had a bit of a problem with configuration files - specifically that they kept multiplying. At the end of the project, I now have no fewer than 3-4 different configuration files - all in different formats! I'd love to consolidate them into 1 single configuration file, to make configuration much easier for the end user.

1 for the pin definitions and features to enable / disable in C++ #define statements for the IoT device
Another 1 in C++ for the private TTN LMiC settings
1 in TOML for the server
1 in Javascript for the web interface

Ideally, I want everything to be defined in the TOML configuration file, since I developed a system by which I have a default configuration file, and a custom one in which the user can override any of the default settings.

With all this said, the output isn't bad:

A screenshot of the web interface. See below.

(Above: A screenshot of the web interface, with the background map removed for privacy)

The AIs do appear to have learnt a round circle of coverage around the gateways, which is somewhat frustrating - since the aim was to take obstructions into account too. I suspect that I need more data - and to figure out a way of training the AI of places that other gateways pick up a signal, but the current gateway does not (currently the gateway is only trained on positive readings and negative ones where no gateway picks up the signal).

If anyone is interested in the code behind the project, I have now open-sourced it, and it is available here:

LoRaWAN Signal Mapping

The repository includes source code, user and hardware manuals, and a circuit diagram.

I may blog about this in the future, but it's likely to be a while. I'm starting a PhD this academic year, which is sure to be fun! Expect some posts about that in the near future.

Found this interesting? Comment below!

Next Gen Search, Part 2: Pushing the limits

In the last part, we looked at how I built a new backend for storing inverted indexes for Pepperminty Wiki, which allows for partial index deserialisation and other nice features that boost performance considerably.

Since the last post, I've completed work on the new search system - though there are a few bits around the edges that I still want to touch up and do some more work on.

In this post though, I want to talk about how I generated test data to give my full-text search engine something to chew on. I've done this before for my markov chain program I wrote a while back, and that was so much fun I did it again for my search engine here.

After scratching my head for a bit to think of a data source I could use, I came up with the perfect plan. Ages ago I downloaded a Wikipedia dump - just the content pages in Wikitext markup. Why not use that?

As it turns out, it was a rather good idea. Some processing of said dump was required though to transform it into a format that Pepperminty Wiki can understand, though. Pepperminty Wiki stores pages on disk as flat text files in markdown, and indexes them in pageindex.json. If pageindex.json doesn't exist, then Pepperminty Wiki rebuilds it automagically by looking for content pages on disk.

This makes it easy to import batches of new pages into Pepperminty Wiki, so all we need to do is extract the wiki text, convert to markdown, and import! This ended up requiring a number of different separate steps though, so let's take it 1 at a time

First, we need a Wikipedia database dump in the XML format. These are available from dumps.wikimedia.org. There are many different ones available, but I suggest grabbing one that has a filename similar to enwiki-20180201-pages-articles.xml - i.e. just content pages - no revision history, user pages, or additional extras. I think the most recent one as of the time of posting is downloadable here - though I'd warn you that it's 15.3GiB in size! You can see a list of available dump dates for the English Wikipedia here.

Now that we've got our dump, let's extract the pages from it. This is nice and easy to do with wikiextractor on GitHub:

nice -n20 wikiextractor enwiki-20180201-pages-articles.xml --no_templates --html --keep_tables --lists --links --sections --json --output wikipages --compress --bytes 25M >progress.log 2>&1

This will parse the dump and output a number of compressed files to the wikipages directory. These will have 1 JSON object per line, each containing information about a single page on Wikipedia - with page content pre-converted to HTML for us. The next step is to extract the page content and save it to a file with the correct name. This ended up being somewhat complicated, so I wrote a quick Node.js script to do the job:

#!/usr/bin/env node

const readline = require("readline");
const fs = require("fs");


if(!fs.existsSync("pages"))
        fs.mkdirSync("pages", { mode: 0o755 });

// From https://stackoverflow.com/a/44195856/1460422
function html_unentities(encodedString) {
        var translate_re = /&(nbsp|amp|quot|lt|gt);/g;
        var translate = {
                "nbsp":" ",
                "amp" : "&",
                "quot": "\"",
                "lt"  : "<",
                "gt"  : ">"
        };
        return encodedString.replace(translate_re, function(match, entity) {
                return translate[entity];
        }).replace(/&#(\d+);/gi, function(match, numStr) {
                var num = parseInt(numStr, 10);
                return String.fromCharCode(num);
        });
}

const interface = readline.createInterface({
        input: process.stdin,
        //output: process.stdout
});

interface.on("line", (text) => {
        const obj = JSON.parse(text);

        fs.writeFileSync(`pages/${obj.title.replace(/\//, "-")}.html`, html_unentities(obj.text));
        console.log(`${obj.id}\t${obj.title}`);
});

This basically takes the stream of JSON object on the standard input, parses them, and saves the relevant content to disk. We can invoke it like so:

bzcat path/to/*.bz2 | ./parse.js

Don't forget to chmod +x parse.js if you get an error here. The other important thing about the above script ist hat we have to unescape the HTML entities (e.g. >), because otherwise we'll have issues later with HTML conversation and page names will look odd. This is done by the html_unentities() function in the above script.

This should result in a directory containing a large number of files - 1 file per content page. This is much better, but we're still not quite there yet. Wikipedia uses wiki markup (which we converted to HTML with wikiextractor) and Pepperminty Wiki uses Markdown - the 2 of which are, despite all their similarities, inherently incompatible. Thankfully, pandoc is capable of converting from HTML to markdown.

Pandoc is great at this kind of thing - it uses an intermediate representation and allows you to convert almost any type of textual document format to any other format. Markdown to PDF, EPUB to plain text, ..... and HTML to markdown (just to name a few). It actually looks like it shares a number of features with traditional compilers like GCC.

Anyway, let's use it to convert our folder full of wikitext files to a folder full of markdown:

mkdir -p pages_md;
find pages/ -type f -name "*.html" -print0 | nice -n20 xargs -P4 -0 -n1 -I{} sh -c 'filename="{}"; title="${filename##*/}"; title="${title%.*}"; pandoc --from "html"  --to "markdown+backtick_code_blocks+pipe_tables+strikeout" "${filename}" -o "pages_md/${title}.md"; echo "${title}";';

_(See this on explainshell.com - doesn't include the nice -n20 due to a bug on their end)_

This looks complicated, but it really isn't. Let's break it down a bit:

find pages/ -type f -name "*.html" -print0

This finds all the HTML files that we want to convert to Markdown, and delimits the output with a NUL byte - i.e. 0x0`. This makes the next step easier:

... | nice -n20 xargs -P4 -0 -n1 -I{} sh -c '....'

This pushes a new xargs instance into the background, which will execute 4 commands at a time. xargs executes a command for each line of input it receives. In our case, we're delimiting with NUL 0x0 instead though. We explicitly specify that we can 1 command per line of input though, as xargs tries to optimise and do command file1 file2 file3 instead.

The sh -c bit is starting a subshell, in which we execute a small wrapper script that then calls pandoc. This is of course inefficient, but I couldn't find any way around spawning a subshell in this instance.

filename="{}";
title="${filename##*/}";
title="${title%.*}";
pandoc --from "html" --to "markdown+backtick_code_blocks+pipe_tables+strikeout" "${filename}" -o "pages_md/${title}.md";
echo "${title}";

I've broken the sh -c subshell script down into multiple liens for readability. Simply put, it extracts the page title from the filename, converts the HTML to Markdown, and saves it to a new file in a different directory with the .md replacing the original .html extension.

When you put all these components together, you get a script that converts a folder full of HTML files to Markdown. Just like with the markov chains extraction I mentioned at the beginning of this post, Bash and shell scripting really is all about lego bricks. This is due in part to the Unix philosophy:

Make each program do one thing well.

There is more to it, but this is the most important point to remember. Many of the core utilities you'll find on the terminal follow this way of thinking.

There's 1 last thing we need to take care of before we have them in the right format though - we need to convert the [display text](page name) markdown-format links back into the Wikipedia [[internal link]] format that Pepperminty Wiki also uses.

Thankfully, another command-line tool I know of called repren is well-suited to this:

repren --from '\[([^\]]+)\]\(([^):]+)\)' --to '[[\1]]' pages_md/*.md

It took some fiddling, but I got all the escaping figured out and the above converts back into the [[internal link]] format well enough.

Now that we've got our folder full of markdown files, we need to extract a random portion of them to act as a test for Pepperminty Wiki - as the whole lot might be a bit much for it to handle (though if Pepperminty Wiki was capable of handling it all eventually that'd be awesome :D). Let's try 500 pages to start:

find path/to/wikipages/ -type f -name "*.md" -print0 | shuf --zero-terminated | head -n500 --zero-terminated | xargs -0 -n1 -I{} cp "{}" .

(See this on explainshell.com)

This is another lego-brick style command. Let's break it down too:

find path/to/wikipages/ -type f -name "*.md" -print0

This lists all the .md files in a directory, delimiting them with a NUL character, as before. It's better to do this than use ls, as find is explicitly designed to be machine-readable.

.... | shuf --zero-terminated

The shuf command randomly shuffles the input lines. In this case, we're telling it that the input is delimited by the NUL byte.

.... | head -n500 --zero-terminated

Similar deal here. head takes the top N lines of input, and discards the rest.

.... | xargs -0 -n1 -I{} cp "{}" .

Finally, xargs calls cp to copy the selected files to the current directory - which is, in this case, the root directory of my test Pepperminty Wiki instance.

Since I'm curious, let's now find out roughly how many words we're dealing with here:

cat data_test/*.md | wc --words
1593190

1.5 million words! That's a lot. I wonder how quickly we can search that?

24.8ms? Awesome! That's so much better than before. If you're wondering about new coat of paint in the screenshot - Pepperminty Wiki is getting dark theme, thanks to prefers-color-scheme :D

I wonder what happens if we push it to 2K pages?

This time we get ~120ms for 5.9M total words - wow! I wasn't expecting it to perform so well. At this scale, rebuilding the entire index is particularly costly - so if I was to push it even further it would make sense to implement an incremental approach that spreads the work over multiple requests, assuming I can't squeeze any more performance out the system as-is.

The last thing I want to do here is make a rough estimate about time time complexity the search system as-is, given the data we have so far. This isn't particularly difficult to do.

Given the results above, we can calculate that at 1.5M total words, an increase of ~60K total words results in an increase of 1ms of execution time. At 5.9M words, it's only ~49K words / ms of execution time - a drop of ~11K words / ms of execution time.

From this, we can speculate that for every million words total added to a wiki, we can expect a ~2.5K words / ms of execution time drop - not bad! We'd need more data points to make any reasonable guess as to the Big-O complexity function that it conforms to. My guess would be something like $O(xN^2)$, where x is a constant between ~0.2 and 2.

Maybe at some point I'll go to the trouble of running enough tests to calculate it, but with all the variables that affect the execution time (number of pages, distribution of words across pages, etc.), I'm not in any hurry to calculate it. If you'd like to do so, go ahead and comment below!

Next time, I'll unveil the inner working of the STAS: my new search-term analysis system.

Found this interesting? Got your own story about some cool code you've written to tell? Comment below!

Next Gen Search, Part 1: Backend Storage

I've got a bit of a thing about full-text search engines. I've talked about one in particular before for Pepperminty Wiki, and I was thinking about it again the other day - and how I could optimise it further.

If you haven't already, I do recommend reading my previous post on the curious question - as a number of things in this post might not make sense otherwise.

Between the time I wrote that last post and now, I've done quite a bit more digging into the root causes of that ~450ms search time, and I eventually determined that most of it was actually being spent deserialising the inverted index from the JSON file it's stored in back into memory.

This is horribly inefficient and is actually taking up 90% of that query time, so I got to thinking about what I could do about it.

My solution was multi-faceted, as I also (separately) developed a new search-term analysis system (I abbreviated to STAS, because it sounds cool :D) to add advanced query syntax such as -dog to exclude pages that contain the word dog and the like - which I also folded in at the same time as fixing the existing bottleneck.

As of the time of typing, the work on STAS is still ongoing. This doesn't mean that I can't blog about the rest of it though! I've recently-ish come into contact with key-value data stores, such as Redis and LevelDB (now RocksDB). They work rather like a C♯ dictionary or a browser's localStorage, in that they store values that are associated with unique keys (Redis is a bit of a special case though, for reasons that I won't get into here).

Such a data store would suit an inverted index surprisingly well. I devised a key system to represent an inverted index:

The first key, |termindex|, is used to store a master list of words that have entries in the page index.
The second key, term is simply the word itself (e.g. cat, chicken, rocket, etc.), and stores a list of ids of the pages that contain that word.
The third and final key, term|pageid, is a word followed by a separator and finally the id of a page (e.g. bill|1, cat|45, boosters|69, etc.). These keys store the locations that a particular word appears at in a given document in the index. A separate id index is needed to convert between the page id and it's associated name - Pepperminty Wiki provides this functionality out-of-the-box.

The more I thought about it, the more I liked it. If I use a key-value data store in this manner, I can store the values as JSON objects - and then I only have to deserialise the parts of the index that I actually use. Furthermore, adding this extra layer of abstraction allows for some clever trickery to speed things up even more.

The problem here is that Pepperminty Wiki is supposed to be portable, so I try not to use any heavy external libraries or depend on odd PHP modules that not everyone will have installed. While a LevelDB extension for PHP does exist, it's not installed by default and it's a PECL module, which are awkward to install.

All isn't lost though, because it turns out that SQLite functions surprisingly well as a key-value data store:

CREATE TABLE store (
    key TEXT UNIQUE NOT NULL,
    value TEXT
);

Yes - it really is that simple! Now all we need is some code to create and interface with such a database. Some simple getters and setters should suffice!

(Can't see the above? Try a direct link.)

While this works, I quickly ran into several major problems:

Every time I read from the data store I'm decoding JSON, which is expensive
Every time I'm saving to the data store, I'm encoding to JSON, which is also expensive
I'm reading and writing the same thing multiple times, which is very expensive
Writing anything to the data store takes a long time

The (partial) solution here was to refactor it such that the json encoding is handled by the storage provider, and to implement a cache.

Such a cache could just be an associative array:

private $cache = [];

Then, to fetch a value, we can do this:

// If it's not in the cache, insert it
if(!isset($this->cache[$key])) {
    $this->cache[$key] = [ "modified" => false, "value" => json_decode($this->query(
        "SELECT value FROM store WHERE key = :key;",
        [ "key" => $key ]
    )->fetchColumn()) ];
}
return $this->cache[$key]["value"];

Notice how each item in the cache is also an associative array. This is so that we can flag items that have been modified in memory, such that when we next sync to disk we can write multiple changes all at once in a single batch. That turns the setter into something like this:

if(!isset($this->cache[$key])) $this->cache[$key] = [];
$this->cache[$key]["value"] = $value;
$this->cache[$key]["modified"] = true;

Very cool! Now all we need is a function to batch-write all the changes to disk. This isn't hard either:

foreach($this->cache as $key => $value_data) {
    // If it wasn't modified, there's no point in saving it, is there?
    if(!$value_data["modified"])
        continue;

    $this->query(
        "INSERT OR REPLACE INTO store(key, value) VALUES(:key, :value)",
        [
            "key" => $key,
            "value" => json_encode($value_data["value"])
        ]
    );
}

I'll get onto the cogs and wheels behind that query() function a bit later in this post. It's one of those utility functions that are great to have around that I keep copying from one project to the next.

Much better, but still not great. Why does it still take ages to write all the changes to disk?

Well, it turns out that by default SQLite wraps every INSERT in it's own transaction. If we wrap our code in an explicit transaction, we can seriously boost the speed even further:

$this->db->beginTransaction();

// Do batch writing here

$this->db->commit();

Excellent (boomed the wizard).

But wait, there's more! The PHP PDO database driver supports prepared statements, which is a fancy way of caching SQL statements and substituting values into them. We use these already, but since we only use a very limited number of SQL queries, we can squeak some additional performance out by caching them in their prepared forms (preparing them is relatively computationally expensive, after all).

This is really easy to do. Let's create another associative array to store them in:

private $query_cache = [];

Then, we can alter the query() function to look like this:

/**
 * Makes a query against the database.
 * @param   string  $sql        The (potentially parametised) query to make.
 * @param   array   $variables  Optional. The variables to substitute into the SQL query.
 * @return  \PDOStatement       The result of the query, as a PDOStatement.
 */
private function query(string $sql, array $variables = []) {
    // Add to the query cache if it doesn't exist
    if(!isset($this->query_cache[$sql]))
        $this->query_cache[$sql] = $this->db->prepare($sql);
    $this->query_cache[$sql]->execute($variables);
    return $this->query_cache[$sql]; // fetchColumn(), fetchAll(), etc. are defined on the statement, not the return value of execute()
}

If a prepared statement for the given SQL exists in the query cache already, we re-use that again instead of preparing a brand-new one. This works especially well, since we perform a large number of queries with the same small set of SQL queries. Get operations all use the same SQL query, so why not cache it?

This completes our work on the backend storage for the next-generation search system I've built for Pepperminty Wiki. Not only will it boost the speed of Pepperminty Wiki's search engine, but it's also a cool reusable component, which I can apply to other applications to give them a boost, too!

Next time, we'll take a look at generating the test data required to stress-test the system.

I also want to give an overview of the new search-term analysis system and associated changes to the page ranking system I implemented (and aspire to implement) too, but those may have to wait until another post (teaser: PageRank isn't really what I'm after).

(Can't see the above? Try a direct link.)

Next

The infrastructure behind Air Quality Web

For a while now, I've been working on Air-Quality-Web, a web interface that displays air quality information. While I haven't blogged about it directly before, a number of posts (a, b, c, d) I've made here have been indirectly related.

Since the air quality data has to come from somewhere, I thought I'd blog a little about the wider infrastructure behind the air quality web interface. My web interface is actually just 1 small part of a much wider stack of software that's being developed as a group by Connected Humber.

Said stack is actually quite distributed, so let's start with a diagram:

From left to right:

As a group we've designed a PCB (mainly thanks to @BNNorman) that acts as the base for sensor nodes themselves - though a number of people have built their own hardware.
Multiple different pieces of software run on top of the various pieces of hardware we've developed - some people use ESP Easy, and others use custom firmware they've implemented themselves.
Embedded devices send the data over WiFi to our MQTT broker (LoRaWAN via The Things Network is currently under development), which currently runs in a Debian Virtual Machine rented from a cloud infrastructure provider.
Another Debian VM hosts a database loading script, which listens for MQTT messages sent to the broker. It adds the data contained within into a database, which runs on the same box.
A final box hosts the web server, which simultaneously hosts the PHP-based HTTP API and the client-side web interface. Both of these are currently located in this repository, but later down the line I'd like to figure out how to decouple them into their own separate repositories.

We can represent the flow of data here in a flowchart, to get a better idea as to how it all fits together:

As you can see, there are many areas of the project that can be worked on independently of each other - depending on what people feel most comfortable working on. Personally, I stick mainly to the HTTP API and the main web interface with a hand in advising on database design, but there are lots of other ways to get involved if you so choose!

Sensors always need building, designing, and programming, and the data generated is available via the public HTTP API (the docs for which can be found here) - so anyone can write their own application on top of the data collected by our sensors. Want a light on your desk (or even your hat) that changes colour depending on your local air quality? Go ahead!

Found this interesting? Comment below!

Summer Project Part 5: When is a function not a function?

Another post! Looks like I'm on a roll in this series :P

In the last post, I looked at the box I designed that was ready for 3D printing. That process has now been completed, and I'm now in possession of an (almost) luminous orange and pink box that could almost glow in the dark.......

I also looked at the libraries that I'll be using and how to manage the (rather limited) amount of memory available in the AVR microprocessor.

Since last time, I've somehow managed to shave a further 6% program space off (though I'm not sure how I've done it), so most recently I've been implementing 2 additional features:

An additional layer of AES encryption, to prevent The Things Network for having access to the decrypted data
GPS delta checking (as I'm calling it), to avoid sending multiple messages when the device hasn't moved

After all was said and done, I'm now at 97% program space and 47% global variable RAM usage.

To implement the additional AES encryption layer, I abused LMiC's IDEETRON AES-128 (ECB mode) implementation, which is stored in src/aes/ideetron/AES-128_V10.cpp.

It's worth noting here that if you're doing crypto yourself, it's seriously not recommended that you use ECB mode. Please don't. The only reason that I used it here is because I already had an implementation to hand that was being compiled into my program, I didn't have the program space to add another one, and my messages all start with a random 32-bit unsigned integer that will provide a measure of protection against collision attacks and other nastiness.

Specifically, it's the method with this signature:

void lmic_aes_encrypt(unsigned char *Data, unsigned char *Key);

Since this is an internal LMiC function declared in a .cpp source file with no obvious header file twin, I needed to declare the prototype in my source code as above - as the method will only be discovered by the compiler when linking the object files together (see this page for more information about the C++ compilation process. While it's for regular Linux executable binaries, it still applies here since the Arduino toolchain spits out a very similar binary that's uploaded to the microprocessor via a programmer).

However, once I'd sorted out all the typing issues, I slammed into this error:

/tmp/ccOLIbBm.ltrans0.ltrans.o: In function `transmit_send':
sketch/transmission.cpp:89: undefined reference to `lmic_aes_encrypt(unsigned char*, unsigned char*)'
collect2: error: ld returned 1 exit status

Very strange. What's going on here? I declared that method via a prototype, didn't I?

Of course, it's not quite that simple. The thing is, the file I mentioned above isn't the first place that a prototype for that method is defined in LMiC. It's actually in other.c, line 35 as a C function. Since C and C++ (for all their similarities) are decidedly different, apparently to call a C function in C++ code you need to declare the function prototype as extern "C", like this:

extern "C" void lmic_aes_encrypt(unsigned char *Data, unsigned char *Key);

This cleaned the error right up. Turns out that even if a function body is defined in C++, what matters is where the original prototype is declared.

I'm hoping to release the source code, but I need to have a discussion with my supervisor about that at the end of the project.

Found this interesting? Come across some equally nasty bugs? Comment below!

Summer Project Part 4: Threading the needle and compacting it down

In the last part, I put the circuit for the IoT device together and designed a box for said circuit to be housed inside of.

In this post, I'm going to talk a little bit about 3D printing, but I'm mostly going to discuss the software aspect of the firmware I'm writing for the Arduino Uno that's going to be control the whole operation out in the field.

Since last time, I've completed the design for the housing and sent it off to my University for 3D printing. They had some great suggestions for improving the design like making the walls slightly thicker (moving from 2mm to 4mm), and including an extra lip on the lid to keep it from shifting around. Here are some pictures:

(Left: The housing itself. Right: The lid. On the opposite side (not shown), the screw holes are indented.)

At the same time as handling sending the housing off to be 3D printed, I've also been busily iterating on the software that the Arduino will be running - and this is what I'd like to spend the majority of this post talking about.

I've been taking an iterative approach to writing it - adding a library, interfacing with it and getting it to do what I want on it's own, then integrating it into the main program.... and then compacting the whole thing down so that it'll fit inside the Arduino Uno. The thing is, the Uno is powered by an ATmega328P (datasheet). Which has 32K of program space and just 2K of RAM. Not much. At all.

The codebase I've built for the Uno is based on the following libraries.

LMiC (the matthijskooijman fork) - the (rather heavy and needlessly complicated) LoRaWAN implementation
Entropy for generating random numbers as explained in part 2
TinyGPS, for decoding NMEA messages from the NEO-6M
SdFat, for interfacing with microSD cards over SPI

Memory Management

Packing the whole program into a 32K + 2K box is not exactly an easy challenge, I discovered. I chose to first deal with the RAM issue. This was greatly aided by the FreeMemory library, which tells you how much RAM you've got left at a given point in the execution of your program. While it's a bit outdated, it's still a useful tool. It works a bit like this:

#include <MemoryFree.h>;

void setup() {
    Serial.begin(115200);
    Serial.println(freeMemory, DEC);
    char test[] = "Bobs Rockets";
    Serial.println(freeMemory, DEC); // Should be lower than the above call
}

void loop() {
    // Nothing here
}

It's worth taking a moment to revise the way stacks and heaps work - and the differences between how they work in the Arduino environment and on your desktop. This is going to get rather complicated quite quickly - so I'd advise reading this stack overflow answer first before continuing.

First, let's look at the locations in RAM for different types of allocation:

Things on the stack
Things on the heap
Global variables

Unlike on the device you're reading this on, the Arduino does not support multiple processes - and therefore the entirety of the RAM available is allocated to your program.

Since I wasn't sure about preecisly how the Arduino does it (it's processor architecture-specific), I wrote a simple test program to tell me:

#include <Arduino.h>

struct Test {
    uint32_t a;
    char b;
};

Test global_var;

void setup() {
    Serial.begin(115200);

    Test stack;
    Test* heap = new Test();

    Serial.print(F("Stack location: "));
    Serial.println((uint32_t)(&stack), DEC);

    Serial.print(F("Heap location: "));
    Serial.println((uint32_t)heap, DEC);

    Serial.print(F("Global location: "));
    Serial.println((uint32_t)&global_var, DEC);
}

void loop() {
    // Nothing here
}

This prints the following for me:

Stack location: 2295
Heap location: 461
Global location: 284

From this we can deduce that global variables are located at the beginning of the RAM space, heap allocations go on top of globals, and the stack grows down starting from the end of RAM space. It's best explained with a diagram:

Now for the differences. On a normal machine running an operating system, there's an extra layer of abstraction between where things are actually located in RAM and where the operating system tells you they are located. This is known as virtual memory address translation (see also virtual memory, virtual address space).

It's a system whereby the operating system maintains a series of tables that map physical RAM to a virtual address space that the running processes actually use. Usually each process running on a system will have it's own table (but this doesn't mean that it will have it's own physical memory - see also shared memory, but this is a topic for another time). When a process accesses an area of memory with a virtual address, the operating system will transparently translate the address using the table to the actual location in RAM (or elsewhere) that the process wants to access.

This is important (and not only for security), because under normal operation a process will probably allocate and deallocate a bunch of different lumps of memory at different times. With a virtual address space, the operating system can defragment the physical RAM space in the background and move stuff around without disturbing currently running processes. Keeping the free memory contiguous speeds up future allocations, and ensures that if a process asks for a large block of contiguous memory the operating system will be able to allocate it without issue.

As I mentioned before though, the Arduino doesn't have a virtual memory system - partly because it doesn't support multiple processes (it would need an operating system for that). The side-effect here is that it doesn't defragment the physical RAM. Since C/C++ isn't a managed language, we don't get _heap compaction_ either like in .NET environments such as Mono.

All this leads us to an environment in which heap allocation needs to be done very carefully, in order to avoid fragmenting the heap and causing a stack crash. If an object somewhere in the middle of the heap is deallocated, the heap will not shrink until everything to the right of it is also deallocated. This post has a good explanation of the problem too.

Other things we need to consider are keeping global variables to a minimum, and trying to keep most things on the stack if we can help it (though this may slow the program down if it's copying things between stack frames all the time).

To this end, we need to choose the libraries we use with care - because they can easily break these guidelines that we've set for ourselves. For example, the inbuilt SD library is out, because it uses a global variable that eats over 50% of our available RAM - and it there's no way (that I can see at least) to reclaim that RAM once we're finished with it.

This is why I chose SdFat instead, because it's at least a little better at allowing us to reclaim most of the RAM it used once we're finished with it by letting the instance fall out of scope (though in my testing I never managed to reclaim all of the RAM it used afterwards).

Alternatives like µ-Fat do exist and are even lighter, but they have restrictions such as no appending to files for example - which would make the whole thing much more complicated since we'd have to pre-allocate the space for the file (which would get rather messy).

The other major tactic you can do to save RAM is to use the F() trick. Consider the following sketch:

#include 

void setup() {
    Serial.begin(115200);
    Serial.println("Bills boosters controller, version 1");
}

void loop() {
    // Nothing here
}

On the highlighted line we've got an innocent Serial.println() call. What's not obvious here is that the string literal here is actually copied to RAM before being passed to Serial.println() - using up a huge amount of our precious working memory. Wrapping it in the F() macro forces it to stay in your program's storage space:

Serial.println(F("Bills boosters controller, version 1"));

Saving storage

With the RAM issue mostly dealt with, I then had to deal with the thorny issue of program space. Unfortunately, this is not as easy as saving RAM because we can't just 'unload' something when it's not needed.

My approach to reducing program storage space was twofold:

Picking lightweight alternatives to libraries I needed
Messing with flags of said libraries to avoid compiling parts of libraries I don't need

It is for these reasons that I ultimately went with TinyGPS instead of TinyGPS++, as it saved 1% or so of the program storage space.

It's also for this reason that I've disabled as much of LMiC as possible:

#define DISABLE_JOIN
#define DISABLE_PING
#define DISABLE_BEACONS
#define DISABLE_MCMD_DCAP_REQ
#define DISABLE_MCMD_DN2P_SET

This disables OTAA, Class B functionality (which I don't need anyway), receiving messaages, the duty cycle cap system (which I'm not sure works between reboots), and a bunch of other stuff that I'd probably find rather useful.

In the future, I'll probably dedicate an entire microcontroller to handling LoRaWAN functionality - so that I can still use the features I've had to disable here.

Even doing all this, I still had to trim down my Serial.println() calls and remove any non-essential code to bring it under the 32K limit. As of the time of typing, I've got jut 26 bytes to spare!

Next time, after tuning the TPL5110 down to the right value, we're probably going to switch gears and look at the server-side of things - and how I'm going to be storing the data I receive from the Arudino-based device I've built.

Found this interesting? Got a suggestion? Comment below!

Summer Project Part 3: Putting it together

In the first post in this series, I outlined my plans for my Msc summer project and what I'm going to be doing. In the second post, I talked about random number generation for the data collection.

In this post, I'm going to give a general progress update - which will mostly centre around the Internet of Things device I'm building to collect the signal strength data.

Since the last post, I've got nearly all the parts I need for the project, except the TPL5111 power manager and 4 rechargeable AA batteries (which should be easy to come by - I'm sure I've got some lying around somewhere).

I've also wired the thing up, with a cable standing in for the TPL5111.

The IoT device all wired up. It basically consists of an Arduino Uno with a red Dragino LoRa shield on top, with a pair of small breadboards containing the peripherals and black power management boards respectively.

The power management board there technically doesn't need a breadboard, but it makes mounting it in the box easier.

I still need to splice the connector onto the battery box I had lying around with some soldering and electrical tape - I'll do that later this week.

The wiring there is kind of messy, but I've tested each device individually and they all appear to work as intended. Here's a clearer diagram of what's going on that drew up in Fritzing (sudo apt install fritzing for Linux users):

Speaking of mounting things in the box, I've discovered OpenSCAD thanks to help from a friend and have been busily working away at designing a box to put everything in that can be 3D printed:

I've just got the lid to do next (which I'm going to do after writing this blog post), and then I'm going to get it printed.

With this all done, it's time to start working on the transport for the messages - namely using LMIC to connection to the network and send the GPS location to the application server, which is also unfinished.

The lovely people at the hardware meetup have lent me a full 8-channel LoRaWAN gateway that's connected to The Things Network for my project, which will make this process a lot easier.

Next time, I'll likely talk about 3D printing and how I've been 'threading the needle', so to speak.

Stardust Blog

Tag Cloud