Starbeamrainbowlabs

Stardust
Blog


Archive


Mailing List Articles Atom Feed Comments Atom Feed Twitter Reddit Facebook

Tag Cloud

3d 3d printing account algorithms android announcement architecture archives arduino artificial intelligence artix assembly async audio automation backups bash batch blender blog bookmarklet booting bug hunting c sharp c++ challenge chrome os cluster code codepen coding conundrums coding conundrums evolved command line compilers compiling compression conference conferences containerisation css dailyprogrammer data analysis debugging defining ai demystification distributed computing dns docker documentation downtime electronics email embedded systems encryption es6 features ethics event experiment external first impressions freeside future game github github gist gitlab graphics guide hardware hardware meetup holiday holidays html html5 html5 canvas infrastructure interfaces internet interoperability io.js jabber jam javascript js bin labs latex learning library linux lora low level lua maintenance manjaro minetest network networking nibriboard node.js open source operating systems optimisation outreach own your code pepperminty wiki performance phd photos php pixelbot portable privacy problem solving programming problems project projects prolog protocol protocols pseudo 3d python reddit redis reference release releases rendering research resource review rust searching secrets security series list server software sorting source code control statistics storage svg systemquery talks technical terminal textures thoughts three thing game three.js tool tutorial twitter ubuntu university update updates upgrade version control virtual reality virtualisation visual web website windows windows 10 worldeditadditions xmpp xslt

Measuring maximum RAM usage with Bash on Linux

While running a simulation on my University's Viper HPC, I found that I needed to measure the maximum RAM usage of a simulation that I was running. Since the solution wasn't particularly easy to find, I thought I'd quickly blog about it here.

Doing this isn't actually as easy as you might think. In the end, I used this:

/usr/bin/time --format 'Max RAM working set size: %Mk' command_here --foo bar

....replacing command_here --foo bar with the command you want to measure.

/usr/bin/time is a program that measures how long a command takes to execute, but it's evident that it measures a bunch of other different things as well. While the output format leaves something to be desired (hence the --format in the above), it does the job pretty well.

Note that /usr/bin/time is distinct from the time built-in you get in Bash. Depending on your shell, you may need to explicitly specify the full path to the time binary. In addition, if your system doesn't have the command (like Viper), you may need to copy it from another system that does that has the same CPU architecture.

I forget where it was that I found this solution, but if you comment below then I'll add the credit to this post if your post looks familiar.

As a quick extra, you can limit the time a command can execute for like this:

timeout 60 command_here

That will limit command_here to executing for only 60 seconds.

Found this interesting? Comment below!

How to hash and sign files with GPG and a bit of Bash

When making a release of your software or sending some important documents, it's pretty common practice (especially amongst larger projects) to distribute hashes and GPG signatures along with release binaries themselves. Example projects that do this include:

This is great practice, since it allows downloaders to verify that their download has not been corrupted, and that it was you who released them and not some imposter.

In this post, I'm going to outline how you can do this too.

I've recently both verified a number of signatures and generated some of my own too, so I thought I'd post about it here to show others how to do it too.

Verification

Before we get into the generation of hashes and signatures, we should first talk about verifying them. I'm mentioned this is good practice already, so it makes sense to briefly talk about verification first. First, let's download Nomad version 0.10.5. You can grab the files here. Download the following files:

  • nomad_0.10.5_SHA256SUMS - The hashes themselves
  • nomad_0.10.5_SHA256SUMS.sig - The GPG signature of the above file
  • nomad_0.10.5_linux_arm.zip - Nomad itself, for the ARM architecture (feel free to pick whichever one you like and adapt these instructions accordingly)

Verifying the hashes is easier, so let's do that first. We can see from the filenames that we have SHA 256 hashes, so we'll want the sha256sum command. Windows users will need to use the Windows Subsystem for Linux, or setup an msys environment:

sha256sum --ignore-missing --check nomad_0.10.5_SHA256SUMS

It should output an OK message and return an exit code zero (to check this in a script, you can do echo "$?" directly after running it to check the exit code). 99% of the time this check will succeed, but you'll be glad that you checked the 1% of the time it fails.

Checking the hashes here is the bit that ensures that the files haven't been corrupted. Next, we'll verify the GPG signature. This is the bit that ensures that the files we've downloaded were actually originally released by who we think they were. Do that like this:

gpg --verify nomad_0.10.5_SHA256SUMS.sig

Doing this, you may get an error message telling you that it can't verify the signature because you haven't got the public key of the signer imported into your keyring. To remedy this, look for the bit that tells you the key id (e.g. using RSA key 51852D87348FFC4C). Copy it, and then do this:

gpg --recv-keys 51852D87348FFC4C

This will download it and import the key id into your local GPG keyring. Then re-run the gpg --verify command above, and it should work.

Generation

Now that we know how to verify a signature, let's generate our own. First, put the files you want to hash into a directory and cd into it in your terminal. Then, let's generate the hash file:

# Hash files
find . -type f -not -name "*.SHA256*" -print0 | xargs -0 -I{} -P"$(nproc)" sha256sum -b "{}" >HASHES.SHA256

We use find here to locate all the files (other than the hash file itself), and then pass them to xargs, which calls sha256sum to hash the files in question. Finally, we write the hashes to HASHES.SHA256.

Next, lets generate a GPG signature for the hash file. For this, you'll need a GPG key. That's out of scope of this post really, but this tutorial looks like it will show you how to do it. Note that in order for other people to verify the GPG signature you create, you'll probably need to upload your GPG public key to a keyserver (the article I link to shows you how to do this too).

Once done, generate the signature like this:

# Sign the hashes
gpg --sign --detach-sign --armor HASHES.SHA256

Specifically, we generate a detached signature here - meaning that it's in a separate file to the file that is being signed. The --armor there just means to wrap the signature in base64 encoding (I'm pretty sure) and some text so that it's not a raw binary file that might confuse the uninitiated.

Finally, let's verify the signature we just created - just in case:

# Verify the signature, and check we used the right key
gpg --verify HASHES.SHA256.asc

If all's good, it should tell you that the signature is ok. If you've got multiple keys, ensure that you signed it with the correct key here. GPG will sign things with the key you have marked as the default key.

Edit 27th October 2021: Fixed issues with HASHES.SHA256 generation with find

Found this interesting? Having trouble? Want to say hi? Comment below!

I've got an apt repository, and you can too

Hey there!

In this post, I want to talk about my apt repository. I've had it for a while, but since it's been working well for me I thought I'd announce it to wider world on here.

For those not in the know, an apt repository is a repository of software in a particular format that the apt package manager (found on Debian-based distributions such as Ubuntu) use to keep software on a machine up-to-date.

The apt package manager queries all repositories it has configured to find out what versions of which packages they have available, and then compares this with those locally installed. Any packages out of date then get upgraded, usually after prompting you to install the updates.

Linux distributions based on Debian come with a large repository of software, but it doesn't have everything. For this reason, extra repositories are often used to deliver updates to software automatically from third parties.

In my case, I've been finding increasingly that I've been wanting to deliver updates for software that isn't packaged for installation with apt to a number of different machines. Every time I get around to installing update it felt like it was time to install another, so naturally I got frustrated enough with it that I decided to automate my problems away by scripting my own apt repository!

My apt repository can be found here: https://starbeamrainbowlabs.com/

It comes in 2 parts. Firstly, there's the repository itself - which is managed by a script that's based on my lantern build engine. It's this I'll be talking about in this post.

Secondly, I have a number of as yet adhoc custom Laminar job scripts for automatically downloading various software projects from GitHub, such that all I have to do is run laminarc queue apt-softwarename and it'll automatically package the latest version and upload it to the repository itself, which has a cron job set to fold in all of the new packages at 2am every night. The specifics of this are best explain in another post.

Currently this process requires me to login and run the laminarc command manually, but I intend to automate this too in the future (I'm currently waiting for a new release of beehive to fix a nasty bug for this).

Anyway, currently I have the following software packaged in my repository:

  • Gossa - A simple HTTP file browser
  • The Tiled Map Editor - An amazing 2D tile-based graphical map editor. You should sponsor the developer via any of the means on the Tiled Map Editor's website before using my apt package.
  • tldr-missing-pages - A small utility script for finding tldr-pages to write
  • webhook - A flexible webhook system that calls binaries and shell scripts when a HTTP call is made
    • I've also got a pleaserun-based service file generator packaged for this too in the webhook-service package

Of course, more will be coming as and when I discover and start using cool software.

The repository itself is driven by a set of scripts. These scripts were inspired by a stack overflow post that I have since lost, but I made a number of usability improvements and rewrote it to use my lantern build engine as I described above. I call this improved script aptosaurus, because it sounds cool.

To use it, first clone the repository:

git clone https://git.starbeamrainbowlabs.com/sbrl/aptosaurus.git

Then, create a new GPG key to sign your packages with:

gpg --full-generate-key

Next, we need to export the new keypair to disk so that we can use it in scripts. Do that like this:

# Identify the key's ID in the list this prints out
gpg --list-secret-keys
# Export the secret key
gpg --export-secret-keys --armor INSERT_KEY_ID_HERE >secret.gpg
chmod 0600 secret.gpg # Don't forget to lock down the permissions
# Export the public key
gpg --export --armor INSERT_KEY_ID_HERE >public.gpg

Then, run the setup script:

./aptosaurus.sh setup

It should warn you if anything's amiss.

With the setup complete, you can new put your .deb packages in the sources subdirectory. Once done, run the update command to fold them into the repository:

./aptosaurus.sh update

Now you've got your own repository! Your next step is to setup a static web server to serve the repo subdirectory (which contains the repo itself) to the world! Personally, I use Nginx with the following config:

server {
    listen  80;
    listen  [::]:80;
    listen  443 ssl http2;
    listen  [::]:443 ssl http2;

    server_name apt.starbeamrainbowlabs.com;
    ssl_certificate     /etc/letsencrypt/live/$server_name/fullchain.pem;
    ssl_certificate_key /etc/letsencrypt/live/$server_name/privkey.pem;

    #add_header strict-transport-security "max-age=31536000;";
    add_header x-xss-protection "1; mode=block";
    add_header x-frame-options  "sameorigin";
    add_header link '<https://starbeamrainbowlabs.com$request_uri>; rel="canonical"';

    index   index.html;
    root    /srv/aptosaurus/repo;

    include /etc/nginx/snippets/letsencrypt.conf;

    autoindex   off;
    fancyindex  on;
    fancyindex_exact_size   off;
    fancyindex_header   header.html;

    #location ~ /.well-known {
    #   root    /srv/letsencrypt;
    #}

}

This requires the fancyindex module for Nginx, which can be installed with sudo apt install libnginx-mod-http-fancyindex on Ubuntu-based systems.

To add your new apt repository to a machine, simply follow the instructions for my repository, replacing the domain name and the key ids with yours.

Hopefully this release announcement-turned-guide has been either interesting, helpful, or both! Do let me know in the comments if you encounter any issues. If there's enough interest I'll migrate the code to GitHub from my personal Git server if people want to make contributions (express said interest in the comments below).

It's worth noting that this is only a very simply apt repository. Larger apt repositories are sectioned off into multiple categories by distribution and release status (e.g. the Ubuntu repositories have xenial, bionic, eoan, etc for the version number of Ubuntu, and main, universe, multiverse, restricted, etc for the different categories of software).

If you setup your own simple apt repository using this guide, I'd love it if you could let me know with a comment below too.

Own Your Code Series List

Hey there! It's time for another series list. This time it's for my Own Your Code series, where I take a look into Gitea and Laminar CI.

Following this series, I plan to also post about my apt repository, which is hosting a growing list of software - including the tiled map editor (support them with a donation if you can), gossa (a minimalist file browser interface), and webhook - if you find any issues, you can always get in touch.

Anyway, here's the full list of posts in the series in the Own Your Code series:

In the unlikely event I post another entry in this series, I'll come back and update this list. Most likely though I'll be posting related things standalone, rather than part of this series - so subscribe for updates with your favourite method if you'd like to stay up-to-date with my latest blog posts (Atom/RSS, Email, Twitter, Reddit, and Facebook are all supported - just ask if there's something missing).

Exporting an SQLite3 database to a directory of CSV files

Recently I was working with a dataset I acquired for my PhD, and to pre-process said dataset into something more sensible I imported it into an SQLite3 database. Once I was finished processing it, I then needed to export it again into regular CSV files so that I could do other things, such as plot it with GNUPlot, or import it into InfluxDB (more on InfluxDB in a later post).

With the help of Stack Overflow and the SQLite3 man page, this didn't prove to be too difficult. To export a single SQLite3 table to a CSV file, you do this:

sqlite3 -bail -header -csv "bobsrockets.sqlite3" "SELECT * FROM 'table_name';" >"path/to/output_file.csv";

This is great for a single table, but what if we want to export all the tables? Well, we can iterate over all the tables in an SQLite3 database like so:

while read table_name; do
    echo "Exporting ${table_name}";

    # Do stuff
done < <(sqlite3 "bobsrockets.sqlite3" ".tables");

If we combine this with the previous snippet, we can export all the tables like so:

while read table_name; do
    log "Exporting ${table_name}";

    sqlite3 -bail -header -csv "bobsrockets.sqlite3" "SELECT * FROM '${table_name}';" >"${table_name}.csv"; 
done < <(sqlite3 "bobsrockets.sqlite3" ".tables");

Cool! We can make it even better with some simple improvements though:

  1. It's a pain to have to edit the script every time we want to change the database we're exporting
  2. It would be nice to be able to specify the output directory without editing the script too

Satisfying both of these points isn't particularly challenging. 10 minutes of fiddling got this the final completed script:

#!/usr/bin/env bash
set -e; # Don't allow errors

show_usage() {
    echo -e "Usage:";
    echo -e "\t./sqlite2csv.sh {db_filename} {output_dir}";
}

log() {
    echo -e "[ $(date +"%F %T") ] ${@}";
}

###############################################################################

db_filename="${1}";
output_dir="${2}";

if [ -z "${db_filename}" ]; then
    echo "Error: No database filename specified.";
    show_usage; exit;
fi
if [ -z "${output_dir}" ]; then
    echo "Error: No output directory specified.";
    show_usage; exit;
fi

if [ ! -d "${output_dir}" ]; then
    mkdir -p "${output_dir}"; 
fi

log "Output directory is ${output_dir}";

while read table_name; do
    log "Exporting ${table_name}";

    sqlite3 -bail -header -csv "${db_filename}" "SELECT * FROM '${table_name}';" >"${output_dir}/${table_name}.csv";    
done < <(sqlite3 "${db_filename}" ".tables");

log "Complete!";

Found this useful? Comment below!

Own your code, part 6: The Lantern Build Engine

It's time again for another installment in the own your code series! In the last post, we looked at the git post-receive hook that calls the main git-repo Laminar CI task, which is the core of our Continuous Integration system (which we discussed in the post before that). You can see all the posts in the series so far here.

In this post we're going to travel in the other direction, and look at the build script / task automation engine that I've developed that goes hand-in-hand with the Laminar CI system - though it can and does stand on it's own too.

Introducing the Lantern Build Engine! Finally, after far too long I'm going to formally post here about it.

Originally developed out of a need to automate the boring and repetitive parts of building and packing my assessed coursework (ACWs) at University, the lantern build engine is my personal task automation system. It's written in 100% Bash, and allows tasks to be easily defined like so:

task_dostuff() {
    task_begin "Doing a thing";
    do_work;
    task_end "$?" "Oops, do_work failed!";

    task_begin "Doing another thing";
    do_hard_work;
    task_end "$?" "Yikes! do_hard_work failed.";
}

When the above task is run, Lantern will automatically detect the dustuff task, since it's a bash function that's prefixed with task_. The task_begin and task_end calls there are 2 other bash functions, which generate pretty output to inform the user that a task is starting or ending. The $? there grabs the exit code from the last command - and if it fails task_end will automatically display the provided error message.

Tasks are defined in a build.sh file, for which Lantern provides a template. Currently, the template file contains some additional logic such as the help text output if no tasks were specified - which is left-over from the time when Lantern was small enough to fit in the same file as the build tasks themselves.

I'm in the process of adding support for the all the logic in the template file, so that I can cut down on the extra boilerplate there even further. After defining your tasks in a copy of the template build file, it's really easy to call them:

./build dostuff

Of course, don't forget to mark the copy of the template file executable with chmod +x ./build.

The above initial example only scratches the surface of what Lantern can do though. It can easily check to see if a given command is installed with check_command:

task_go-to-the-moon() {
    task_begin "Checking requirements";
    check_command git true;
    check_command node true;
    check_command npm true;
    task_end 0;
}

If any of the check_command calls fail, then an error message is printed and the build terminated.

Work that needs doing in Lantern can be expressed with 3 levels of logical separation: stages, tasks, and subtasks:

task_build-rocket() {
    stage_begin "Preparation";

    task_begin "Gathering resources";
    gather_resources;
    task_end "$?" "Failed to gather resources";

    task_begin "Hiring engineers";
    hire_engineers;
    task_end "$?" "Failed to hire engineers";

    stage_end "$?";

    stage_begin "Building Rocket";
    build_rocket --size big --boosters 99;
    stage_end "$?";

    stage_begin "Launching rocket";
    task_begin "Preflight checks";
    subtask_begin "Checking fuel";
    check_fuel --level full;
    subtask_end "$?" "Error: The fuel tank isn't full!";
    subtask_begin "Loading snacks";
    load_items --type snacks --from warehouse;
    subtask_end "$?" "Error: Failed to load snacks!";
    task_end "$?";

    task_begin "Launching!";
    launch --countdown 10;
    task_end "$?";

    stage_end "$?";
}

Come to think about it, I should probably rename the function prefix from task to job. Stages, tasks, and subtasks each look different in the output - so it's down to personal preference as to which one you use and where. Subtasks in particular are best for commands that don't return any output.

Popular services such as [Travis CI]() have a thing where in the build transcript they display the versions of related programs to the build, like this:

$ uname -a
Linux MachineName 5.3.0-19-generic #20-Ubuntu SMP Fri Oct 18 09:04:39 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
$ node --version
v13.0.1
$ npm --version
6.12.1

Lantern provides support for this with the execute command. Prefixing commands with execute will cause them to be printed before being executed, just like the above:

task_prepare() {
    task_begin "Displaying environment details";
    execute uname -a;
    execute node --version;
    execute npm --version;
    task_end "$?";
}

As build tasks get more complicated, it's logical to split them up into multiple tasks that can be called independently and conditionally. Lantern makes this easy too:

task_build() {
    task_begin "Building";
    # Do build stuff here
    task_end "$?";
}
task_deploy() {
    task_begin "Deploying";
    # Do deploy stuff here
    task_end "$?";
}

task_all() {
    tasks_run build deploy;
}

The all task in the above runs both the build and deploy tasks. In fact, the template build script uses tasks_run at the very bottom to treat every argument passed to it as a task name, leading to the behaviour described above.

Lantern also provides an array of other useful functions to make expressing build sequences easy, concise, and readable - from custom colours to testing environment variables to see if they exist. It's all fully documented in the README of the project too.

As described 2 posts ago, the git-repo Laminar CI task (once it's spawned a hologram of itself) currently checks for the existence of a build or build.sh executable script in the root of the repository it is running on, and passes ci as the first and only argument.

This provides easy integration with Lantern, since Lantern build scripts can be called anything we like, and with a tasks_run call at the bottom as in the template file, we can simply define a ci Lantern task function that runs all our continuous integration jobs that we need to execute.

If you're interested in trying out Lantern for yourself, check out the repository!

https://gitlab.com/sbrl/lantern-build-engine#lantern-build-engine

Personally, I use it for everything from CI to rapid development environment setup.

This concludes my (epic) series about my git hosting and continuous integration. We've looked at git hosting, and taken a deep dive into integrating it into a continuous integration system, which we've augmented with a bunch of scripts of our own design. The system we've ended up with, while a lot of work to setup, is extremely flexible, allowing for modifications at will (for example, I have a webhook script that's similar to the git post-receive hook, but is designed to receive notifications from GitHub instead of Gitea and queue the git-repo just the same).

I'll post a series list post soon. After that, I might blog about my personal apt repository that I've setup, which is somewhat related to this.

Own your code, part 5: git post-receive hook

In the last post, I took a deep dive into the master git-repo job that powers the my entire build system. In the next few posts, I'm going to take a look at the bits around the edges that interact with this laminar job - starting with the git post-receive hook in this post.

When you push commits to a git repository, the remote server does a bunch of work to integrate your changes into the remote master copy of the repository. At various points in the process, git allows you to run scripts to augment your repository, and potentially alter the way git ultimately processes the push. You can send content back to the pushing user too - which is how you get those messages on the command-line occasionally when you push to a GitHub repository.

In our case, we want to queue a new Laminar CI job when new commits are pushed to a private Gitea server, for instance (like mine). Doing this isn't particularly difficult, but we do need to collect a bunch of information about the environment we're running in so that we can correctly inform the git-repo task where it needs to pull the repository from, who pushed the commits, and which commits need testing.

In addition, we want to write 1 universal git post-receive hook script that will work everywhere - regardless of the server the repository is hosted on. Of course, on GitHub you can't run a script directly, but if I ever come into contact with another supporting git server, I want to minimise the amount of extra work I've got to do to hook it up.

Let's jump into the script:

#!/usr/bin/env bash
if [ "${GIT_HOST}" == "" ]; then
    GIT_HOST="git.starbeamrainbowlabs.com";
fi

Fairly standard stuff. Here we set a shebang and specify the GIT_HOST variable if it's not set already. This is mainly just a placeholder for the future, as explained above.

Next, we determine the git repository's url, because I'm not sure that Gitea (my git server, for which this script is intended) actually tells you directly in a git post-receive hook. The post-receive hook script does actually support HTTPS, but this support isn't currently used and I'm unsure how the git-repo Laminar CI job would handle a HTTPS url:

# The url of the repository in question. SSH is recommended, as then you can use a deploy key.
# SSH:
GIT_REPO_URL="git@${GIT_HOST}:${GITEA_REPO_USER_NAME}/${GITEA_REPO_NAME}.git";
# HTTPS:
# git_repo_url="https://git.starbeamrainbowlabs.com/${GITEA_REPO_USER_NAME}/${GITEA_REPO_NAME}.git";

With the repository url determined, next on the list is the identity of the pusher. At this stage it's a simple matter of grabbing the value of 1 variable and putting it in another as we're only supporting Gitea at the moment, but in the future we may have some logic here to intelligently determine this value.

GIT_AUTHOR="${GITEA_PUSHER_NAME}";

With the basics taken care of, we can start getting to the more interesting bits. Before we do that though, we should define a few common settings:

###### Internal Settings ######

version="0.2";

# The job name to queue.
job_name="git-repo";

###############################

job_name refers to the name of the Laminar CI job that we should queue to process new commits. version is a value that we can increment should we iterate on this script in the future, so that we can then tell which repositories have the new version of the post-receive hook and which ones don't.

Next, we need to calculate the virtual name of the repository. This is used by the git-repo job to generate a 'hologram' copy of itself that acts differently, as explained in the previous post. This is done through a series of Bash transformations on the repository URL:

# 1. Make lowercase
repo_name_auto="${GIT_REPO_URL,,}";
# 2. Trim git@ & .git from url
repo_name_auto="${repo_name_auto/git@}";
repo_name_auto="${repo_name_auto/.git}";
# 3. Replace unknown characters to make it 'safe'
repo_name_auto="$(echo -n "${repo_name_auto}" | tr -c '[:lower:]' '-')";

The result is quite like 'slugification'. For example, this URL:

[email protected]:sbrl/Linux-101.git

...will get turned into this:

git-starbeamrainbowlabs-com-sbrl-linux----

I actually forgot to allow digits in step #3, but it's a bit awkward to change it at this point :P Maybe at some later time when I'm feeling bored I'll update it and fiddle with Laminar's data structures on disk to move all the affected repositories over to the new naming scheme.

Now that we've got everything in place, we can start to process the commits that the user has pushed. The documentation on how this is done in a post-receive hook is a bit sparse, so it took some experimenting before I had it right. Turns out that the information we need is provided on the standard input, so a while-read loop is needed to process it:

while read next_line
do
    # .....
done

For each line on the standard input, 3 variables are provided:

  • The old commit reference (i.e. the commit before the one that was pushed)
  • The new commit reference (i.e. the one that was pushed)
  • The name of the reference (usually the branch that the commit being pushed is on)

Commits on multiple branches can be pushed at once, so the name of the branch each commit is being pushed to is kind of important.

Anyway, I pull these into variables like so:

oldref="$(echo "${next_line}" | cut -d' ' -f1)";
newref="$(echo "${next_line}" | cut -d' ' -f2)";
refname="$(echo "${next_line}" | cut -d' ' -f3)";

I think there's some clever Bash trick I've used elsewhere that allows you to pull them all in at once in a single line, but I believe I implemented this before I discovered that trick.

With that all in place, we can now (finally) queue the Laminar CI job. This is quite a monster, as it needs to pass a considerable number of variables to the git-repo job itself:

LAMINAR_HOST="127.0.0.1:3100" LAMINAR_REASON="Push from ${GIT_AUTHOR} to ${GIT_REPO_URL}" laminarc queue "${job_name}" GIT_AUTHOR="${GIT_AUTHOR}" GIT_REPO_URL="${GIT_REPO_URL}" GIT_COMMIT_REF="${newref}" GIT_REF_NAME="${refname}" GIT_AUTHOR="${GIT_AUTHOR}" GIT_REPO_NAME="${repo_name_auto}";

Laminar CI's management socket listens on the abstract unix socket laminar (IIRC). Since you can't yet forward abstract sockets over SSH with OpenSSH, I instead opt to use a TCP socket instead. To this end, the LAMINAR_HOST prefix there is needed to tell laminarc where to find the management socket that it can use to talk to the Laminar daemon, laminard - since Gitea and Laminar CI run on different servers.

The LAMINAR_REASON there is the message that is displayed in the Laminar CI web interface. Said interface is read-only (by design), but very useful for inspecting what's going on. Messages like this add context as to why a given job was triggered.

Lastly, we should send a message to the pushing user, to let them know that a job has been queued. This can be done with a simple echo, as the standard output is sent back to the client:

echo "[Laminar git hook ${version}] Queued Laminar CI build ("${job_name}" -> ${repo_name_auto}).";

Note that we display the version number of the post-receive hook here. This is how I tell whether I need to give into the Gitea settings to update the hook or not.

With that, the post-receive hook script is complete. It takes a bunch of information lying around, transforms it into a common universal format, and then passes the information on to my continuous integration system - which is then responsible for building the code itself.

Here's the completed script:

#!/usr/bin/env bash

##############################
########## Settings ##########
##############################

# Useful environment variables (gitea):
#   GITEA_REPO_NAME         Repository name
#   GITEA_REPO_USER_NAME    Repo owner username
#   GITEA_PUSHER_NAME       The username that pushed the commits

#   GIT_HOST                Domain name the repo is hosted on. Default: git.starbeamrainbowlabs.com

if [ "${GIT_HOST}" == "" ]; then
    GIT_HOST="git.starbeamrainbowlabs.com";
fi

# The url of the repository in question. SSH is recommended, as then you can use a deploy key.
# SSH:
GIT_REPO_URL="git@${GIT_HOST}:${GITEA_REPO_USER_NAME}/${GITEA_REPO_NAME}.git";
# HTTPS:
# git_repo_url="https://git.starbeamrainbowlabs.com/${GITEA_REPO_USER_NAME}/${GITEA_REPO_NAME}.git";

# The user that pushed the commits
GIT_AUTHOR="${GITEA_PUSHER_NAME}";

##############################

###### Internal Settings ######

version="0.2";

# The job name to queue.
job_name="git-repo";

###############################

# 1. Make lowercase
repo_name_auto="${GIT_REPO_URL,,}";
# 2. Trim git@ & .git from url
repo_name_auto="${repo_name_auto/git@}";
repo_name_auto="${repo_name_auto/.git}";
# 3. Replace unknown characters to make it 'safe'
repo_name_auto="$(echo -n "${repo_name_auto}" | tr -c '[:lower:]' '-')";

while read next_line
do
    oldref="$(echo "${next_line}" | cut -d' ' -f1)";
    newref="$(echo "${next_line}" | cut -d' ' -f2)";
    refname="$(echo "${next_line}" | cut -d' ' -f3)";
    # echo "********";
    # echo "oldref: ${oldref}";
    # echo "newref: ${newref}";
    # echo "refname: ${refname}";
    # echo "********";

    LAMINAR_HOST="127.0.0.1:3100" LAMINAR_REASON="Push from ${GIT_AUTHOR} to ${GIT_REPO_URL}" laminarc queue "${job_name}" GIT_AUTHOR="${GIT_AUTHOR}" GIT_REPO_URL="${GIT_REPO_URL}" GIT_COMMIT_REF="${newref}" GIT_REF_NAME="${refname}" GIT_AUTHOR="${GIT_AUTHOR}" GIT_REPO_NAME="${repo_name_auto}";
    # GIT_REF_NAME and GIT_AUTHOR are used for the LAMINAR_REASON when the git-repo task recursively calls itself
    # GIT_REPO_NAME is used to auto-name hologram copies of the git-repo.run task when recursing
    echo "[Laminar git hook ${version}] Queued Laminar CI build ("${job_name}" -> ${repo_name_auto}).";
done

#cat -;
# YAY what we're after is on the first line of stdin! :D
# The format appears to be documented here: https://git-scm.com/book/en/v2/Customizing-Git-Git-Hooks#_server_side_hooks
# Line format:
# oldref newref refname
# There may be multiple lines that all need handling.

In the next post, I want to finally introduce my very own home-brew build engine: lantern. I've used it in over half a dozen different projects by now, so it's high time I talked about it a bit more formally.

Found this interesting? Spotted a mistake? Got a suggestion to improve it? Comment below!

Next Gen Search, Part 2: Pushing the limits

In the last part, we looked at how I built a new backend for storing inverted indexes for Pepperminty Wiki, which allows for partial index deserialisation and other nice features that boost performance considerably.

Since the last post, I've completed work on the new search system - though there are a few bits around the edges that I still want to touch up and do some more work on.

In this post though, I want to talk about how I generated test data to give my full-text search engine something to chew on. I've done this before for my markov chain program I wrote a while back, and that was so much fun I did it again for my search engine here.

After scratching my head for a bit to think of a data source I could use, I came up with the perfect plan. Ages ago I downloaded a Wikipedia dump - just the content pages in Wikitext markup. Why not use that?

As it turns out, it was a rather good idea. Some processing of said dump was required though to transform it into a format that Pepperminty Wiki can understand, though. Pepperminty Wiki stores pages on disk as flat text files in markdown, and indexes them in pageindex.json. If pageindex.json doesn't exist, then Pepperminty Wiki rebuilds it automagically by looking for content pages on disk.

This makes it easy to import batches of new pages into Pepperminty Wiki, so all we need to do is extract the wiki text, convert to markdown, and import! This ended up requiring a number of different separate steps though, so let's take it 1 at a time

First, we need a Wikipedia database dump in the XML format. These are available from dumps.wikimedia.org. There are many different ones available, but I suggest grabbing one that has a filename similar to enwiki-20180201-pages-articles.xml - i.e. just content pages - no revision history, user pages, or additional extras. I think the most recent one as of the time of posting is downloadable here - though I'd warn you that it's 15.3GiB in size! You can see a list of available dump dates for the English Wikipedia here.

Now that we've got our dump, let's extract the pages from it. This is nice and easy to do with wikiextractor on GitHub:

nice -n20 wikiextractor enwiki-20180201-pages-articles.xml --no_templates --html --keep_tables --lists --links --sections --json --output wikipages --compress --bytes 25M >progress.log 2>&1 

This will parse the dump and output a number of compressed files to the wikipages directory. These will have 1 JSON object per line, each containing information about a single page on Wikipedia - with page content pre-converted to HTML for us. The next step is to extract the page content and save it to a file with the correct name. This ended up being somewhat complicated, so I wrote a quick Node.js script to do the job:

#!/usr/bin/env node

const readline = require("readline");
const fs = require("fs");


if(!fs.existsSync("pages"))
        fs.mkdirSync("pages", { mode: 0o755 });

// From https://stackoverflow.com/a/44195856/1460422
function html_unentities(encodedString) {
        var translate_re = /&(nbsp|amp|quot|lt|gt);/g;
        var translate = {
                "nbsp":" ",
                "amp" : "&",
                "quot": "\"",
                "lt"  : "<",
                "gt"  : ">"
        };
        return encodedString.replace(translate_re, function(match, entity) {
                return translate[entity];
        }).replace(/&#(\d+);/gi, function(match, numStr) {
                var num = parseInt(numStr, 10);
                return String.fromCharCode(num);
        });
}

const interface = readline.createInterface({
        input: process.stdin,
        //output: process.stdout
});

interface.on("line", (text) => {
        const obj = JSON.parse(text);

        fs.writeFileSync(`pages/${obj.title.replace(/\//, "-")}.html`, html_unentities(obj.text));
        console.log(`${obj.id}\t${obj.title}`);
});

This basically takes the stream of JSON object on the standard input, parses them, and saves the relevant content to disk. We can invoke it like so:

bzcat path/to/*.bz2 | ./parse.js

Don't forget to chmod +x parse.js if you get an error here. The other important thing about the above script ist hat we have to unescape the HTML entities (e.g. &gt;), because otherwise we'll have issues later with HTML conversation and page names will look odd. This is done by the html_unentities() function in the above script.

This should result in a directory containing a large number of files - 1 file per content page. This is much better, but we're still not quite there yet. Wikipedia uses wiki markup (which we converted to HTML with wikiextractor) and Pepperminty Wiki uses Markdown - the 2 of which are, despite all their similarities, inherently incompatible. Thankfully, pandoc is capable of converting from HTML to markdown.

Pandoc is great at this kind of thing - it uses an intermediate representation and allows you to convert almost any type of textual document format to any other format. Markdown to PDF, EPUB to plain text, ..... and HTML to markdown (just to name a few). It actually looks like it shares a number of features with traditional compilers like GCC.

Anyway, let's use it to convert our folder full of wikitext files to a folder full of markdown:

mkdir -p pages_md;
find pages/ -type f -name "*.html" -print0 | nice -n20 xargs -P4 -0 -n1 -I{} sh -c 'filename="{}"; title="${filename##*/}"; title="${title%.*}"; pandoc --from "html"  --to "markdown+backtick_code_blocks+pipe_tables+strikeout" "${filename}" -o "pages_md/${title}.md"; echo "${title}";';

_(See this on explainshell.com - doesn't include the nice -n20 due to a bug on their end)_

This looks complicated, but it really isn't. Let's break it down a bit:

find pages/ -type f -name "*.html" -print0

This finds all the HTML files that we want to convert to Markdown, and delimits the output with a NUL byte - i.e. 0x0`. This makes the next step easier:

... | nice -n20 xargs -P4 -0 -n1 -I{} sh -c '....'

This pushes a new xargs instance into the background, which will execute 4 commands at a time. xargs executes a command for each line of input it receives. In our case, we're delimiting with NUL 0x0 instead though. We explicitly specify that we can 1 command per line of input though, as xargs tries to optimise and do command file1 file2 file3 instead.

The sh -c bit is starting a subshell, in which we execute a small wrapper script that then calls pandoc. This is of course inefficient, but I couldn't find any way around spawning a subshell in this instance.

filename="{}";
title="${filename##*/}";
title="${title%.*}";
pandoc --from "html" --to "markdown+backtick_code_blocks+pipe_tables+strikeout" "${filename}" -o "pages_md/${title}.md";
echo "${title}";

I've broken the sh -c subshell script down into multiple liens for readability. Simply put, it extracts the page title from the filename, converts the HTML to Markdown, and saves it to a new file in a different directory with the .md replacing the original .html extension.

When you put all these components together, you get a script that converts a folder full of HTML files to Markdown. Just like with the markov chains extraction I mentioned at the beginning of this post, Bash and shell scripting really is all about lego bricks. This is due in part to the Unix philosophy:

Make each program do one thing well.

There is more to it, but this is the most important point to remember. Many of the core utilities you'll find on the terminal follow this way of thinking.

There's 1 last thing we need to take care of before we have them in the right format though - we need to convert the [display text](page name) markdown-format links back into the Wikipedia [[internal link]] format that Pepperminty Wiki also uses.

Thankfully, another command-line tool I know of called repren is well-suited to this:

repren --from '\[([^\]]+)\]\(([^):]+)\)' --to '[[\1]]' pages_md/*.md

It took some fiddling, but I got all the escaping figured out and the above converts back into the [[internal link]] format well enough.

Now that we've got our folder full of markdown files, we need to extract a random portion of them to act as a test for Pepperminty Wiki - as the whole lot might be a bit much for it to handle (though if Pepperminty Wiki was capable of handling it all eventually that'd be awesome :D). Let's try 500 pages to start:

find path/to/wikipages/ -type f -name "*.md" -print0 | shuf --zero-terminated | head -n500 --zero-terminated | xargs -0 -n1 -I{} cp "{}" .

(See this on explainshell.com)

This is another lego-brick style command. Let's break it down too:

find path/to/wikipages/ -type f -name "*.md" -print0

This lists all the .md files in a directory, delimiting them with a NUL character, as before. It's better to do this than use ls, as find is explicitly designed to be machine-readable.

.... | shuf --zero-terminated

The shuf command randomly shuffles the input lines. In this case, we're telling it that the input is delimited by the NUL byte.

.... | head -n500 --zero-terminated

Similar deal here. head takes the top N lines of input, and discards the rest.

.... | xargs -0 -n1 -I{} cp "{}" .

Finally, xargs calls cp to copy the selected files to the current directory - which is, in this case, the root directory of my test Pepperminty Wiki instance.

Since I'm curious, let's now find out roughly how many words we're dealing with here:

cat data_test/*.md | wc --words
1593190

1.5 million words! That's a lot. I wonder how quickly we can search that?

A screenshot of the Pepperminty Wiki search results on the test wiki for the word food, showing the new dark theme coming soon!

24.8ms? Awesome! That's so much better than before. If you're wondering about new coat of paint in the screenshot - Pepperminty Wiki is getting dark theme, thanks to prefers-color-scheme :D

I wonder what happens if we push it to 2K pages?

Another screenshot, the same as before

This time we get ~120ms for 5.9M total words - wow! I wasn't expecting it to perform so well. At this scale, rebuilding the entire index is particularly costly - so if I was to push it even further it would make sense to implement an incremental approach that spreads the work over multiple requests, assuming I can't squeeze any more performance out the system as-is.

The last thing I want to do here is make a rough estimate about time time complexity the search system as-is, given the data we have so far. This isn't particularly difficult to do.

Given the results above, we can calculate that at 1.5M total words, an increase of ~60K total words results in an increase of 1ms of execution time. At 5.9M words, it's only ~49K words / ms of execution time - a drop of ~11K words / ms of execution time.

From this, we can speculate that for every million words total added to a wiki, we can expect a ~2.5K words / ms of execution time drop - not bad! We'd need more data points to make any reasonable guess as to the Big-O complexity function that it conforms to. My guess would be something like $O(xN^2)$, where x is a constant between ~0.2 and 2.

Maybe at some point I'll go to the trouble of running enough tests to calculate it, but with all the variables that affect the execution time (number of pages, distribution of words across pages, etc.), I'm not in any hurry to calculate it. If you'd like to do so, go ahead and comment below!

Next time, I'll unveil the inner working of the STAS: my new search-term analysis system.

Found this interesting? Got your own story about some cool code you've written to tell? Comment below!

Own your code, part 4: Laminar CI

In the last post, I talked at a high level about the infrastructure behind my continuous integration and deployment system. In this post, I'm going to dive into the details of the Laminar CI job is the engine that drives the whole system.

Laminar CI is based on a concept of jobs. The docs explain it quite well, but in short each job is a file in the jobs folder with the file extension run and a shebang. In my case, I'm using Bash - and I'll continue to do so at regular intervals throughout this series.

Unlike most other setups, the Laminar CI job that we'll be writing here won't actually do any of the actual CI tasks itself - it will simply act as a proxy script to setup & manage the execution of the actual build system - which, in this case, will be the lantern build engine, an engine I wrote to aid me with automating repetitive tasks when working on my University ACWs (Assessed CourseWork).

Every job has it's own workspace, which acts as a common area to store and cache various files across all the runs of that job. Each run of a job also has it's very own private area too - which will be useful later on.

The first step in this proxy script is to extract the parameters of the run that we're supposed to be doing. For me, I store this in a number of environment variables, which are set when queuing the job run from the git post-receive (or web) hook:

Variable Example Description
GIT_REPO_NAME git-starbeamrainbowlabs-com-sbrl-rhinoreminds The safe name of the repository that we're running against, with potentially troublesome characters removed.
GIT_REF_NAME refs/heads/master Basically the branch that we're working on. Useful for logging purposes.
GIT_REPO_URL [email protected]:sbrl/rhinoreminds.git The URL of the repository that we're running against.
GIT_COMMIT_REF e23b2e0.... The exact commit to check out and build.
GIT_AUTHOR The friendly name of the author that pushed the commit. Useful for logging purposes.

Before we do anything else, we need to make sure that these variables are defined:

set -e; # Don't allow errors

# Check that all the right variables are present
if [ -z "${GIT_REPO_NAME}" ]; then echo -e "Error: The environment variable GIT_REPO_NAME isn't set." >&2; exit 1; fi
if [ -z "${GIT_REF_NAME}" ]; then echo -e "Error: The environment variable GIT_REF_NAME isn't set." >&2; exit 1; fi
if [ -z "${GIT_REPO_URL}" ]; then echo -e "Error: The environment variable GIT_REPO_URL isn't set." >&2; exit 1; fi
if [ -z "${GIT_COMMIT_REF}" ]; then echo -e "Error: The environment variable GIT_COMMIT_REF isn't set." >&2; exit 1; fi
if [ -z "${GIT_AUTHOR}" ]; then echo -e "Error: The environment variable GIT_AUTHOR isn't set." >&2; exit 1; fi

There are a bunch of other variables that I'm omitting here, since they are dynamically determined by from the build variables. I extract many of these additional variables using regular expressions. For example:

GIT_REF_TYPE="$(regex_match "${GIT_REF_NAME}" 'refs/([a-z]+)')";

GIT_REF_TYPE is the bit after the refs/ and before the actual branch or tag name. It basically tells us whether we're building against a branch or a tag. That regex_match function is a utility function that I found in the pure bash bible - which is an excellent resource on various tips and tricks to do common tasks without spawning subprocesses - and therefore obtaining superior performance and lower resource usage. Here it is:

# @source https://github.com/dylanaraps/pure-bash-bible#use-regex-on-a-string
# Usage: regex "string" "regex"
regex_match() {
    [[ $1 =~ $2 ]] && printf '%s\n' "${BASH_REMATCH[1]}"
}

Very cool. For completeness, here are the remainder of the secondary environment variables. Many of them aren't actually used directly - instead they are used indirectly by other scripts and lantern build engine tasks that we call from the main Laminar CI job.

if [[ "${GIT_REF_TYPE}" == "tags" ]]; then
    GIT_TAG_NAME="$(regex_match "${GIT_REF_NAME}" 'refs/tags/(.*)$')";
fi

# NOTE: These only work with SSH urls.
GIT_REPO_OWNER="$(echo "${GIT_REPO_URL}" | grep -Po '(?<=:)[^/]+(?=/)')";
GIT_REPO_NAME_SHORT="$(echo "${GIT_REPO_URL}" | grep -Po '(?<=/)[^/]+(?=\.git$)')";
GIT_SERVER_DOMAIN="$(echo "${GIT_REPO_URL}" | grep -Po '(?<=@)[^/]+(?=:)')";

GIT_TAG_NAME is the name of the tag that we're building against - but only if we've been passed a tag as the GIT_REF_TYPE.

The GIT_SERVER_DOMAIN is important for sending the status reports to the right place. Gitea supports a status API that we can hook into to report on how we're doing. You can see it in action here on my RhinoReminds repository. Those green ticks are the build status that was reported by the Laminar CI job that we're writing in this post. Unfortunately you won't be able to click on it to see the actual build output, as that is currently protected behind a username and password, since the Laminar CI web interface exposes all the git project I've currently got setup on it - including a number of private ones that I can't share.

Anyway, with all our environment variables in order, it's time to do something with them. Before we do though, we should tell Gitea that we're starting the build process:

send-status-gitea "${GIT_COMMIT_REF}" "pending" "Executing build....";

I haven't yet implemented support for sending notifications to GitHub, but it's on my todo list. In theory it's pretty easy to do - this is why I've got that GIT_SERVER_DOMAIN variable above in anticipation of this.

That send-status-gitea function there is another helper script I've written that does what you'd expect - it sends a status message to Gitea. It does this by using the environment variables we deduced earlier (that are also exported - though I didn't include that in the abovecode snippet) and curl.

There's still a bunch of stuff to get through in this post, so I'm going to omit the source of that script from this post for brevity. I've got no particular issue with releasing it though - if you're interested, contact me using the details on my homepage.

Next, we need to set an exit trap. This is a function that will run when the Bash process exits - regardless of whether this was because we finished our work successfully, or otherwise. This can be very useful to make absolutely sure that your script cleans up after itself. In our case, we're only going to be using it to report the build status back to Gitea:

# Runs on exit, no matter what
cleanup() {
    original_exit_code="$?";

    status="success";
    description="Build ${RUN} succeeded in $(human-duration "${SECONDS}").";
    if [[ "${original_exit_code}" -ne "0" ]]; then
        status="failed";
        description="Build failed with exit code ${original_exit_code} after $(human-duration "${SECONDS}")";
    fi

    send-status-gitea "${GIT_COMMIT_REF}" "${status}" "${description}";
}

trap cleanup EXIT;

Very cool. The RUN variable there is provided by Laminar CI, and SECONDS is a bash built-in that tells us the number of seconds that the current Bash process has been running for. human-duration is yet another helper script because I like nice readable durations in my status messages - not something unreadable like Build 3 failed in 345 seconds. It's also somewhat verbose - I adapted it from this StackExchange answer.

With that all out of the way, the next item on the list is to work out what job name we're running under. I've chosen git-repo for the name of the master 'virtual' job - that is to say the one whose entire purpose is to queue the actual job. That's pretty easy, since Laminar gives us an environment variable:

if [ "${JOB}" == "git-repo" ]; then
    # ...
fi

If the job name is git-repo, then we need to queue the actual job name. Since I don't want to have to manually alter the system every time I'm setting up a new repo on my CI system, I've automated the process with symbolic links. The main git-repo job creates a symbolic link to itself in the name of the repository that it's supposed to be running against, and then queues a new job to run itself under the different job name. This segment takes place nested in the above if statement:

# If the job file doesn't exist, create it
# We create a symlink here because this is a 'smart' job - whose
# behaviour changes dynamically based on the job name.
if [ ! -e "${LAMINAR_HOME}/cfg/jobs/${repo_job_name}.run" ]; then
    pushd "${LAMINAR_HOME}/cfg/jobs";
    ln -s "git-repo.run" "${repo_job_name}.run";
    popd
fi

Once we're sure that the symbolic link is in place, we can queue the virtual copy:

# Queue our new hologram
LAMINAR_REASON="git push by ${GIT_AUTHOR} to ${GIT_REF_NAME}" laminarc queue "${repo_job_name}" GIT_REPO_NAME="${GIT_REPO_NAME}" GIT_REF_NAME="${GIT_REF_NAME}" GIT_REPO_URL="${GIT_REPO_URL}" GIT_COMMIT_REF="${GIT_COMMIT_REF}" GIT_AUTHOR="${GIT_AUTHOR}";
# If we got to here, we queued the hologram successfully
# Clear the trap, because we know that the trap for the hologram will fire
# This avoids sending a 2nd status to Gitea, linking the user to the wrong place
trap - EXIT;

exit 0;

This also ensures that if we make any changes to the main job file, all the copies will get updated automatically too. After all, they are only pointers to the actual job on disk.

Notice that we also clear the trap there before exiting - that's important, since we're queuing a copy of ourselves, we don't want to report the completed status before we've actually finished.

At this point, we can now look at what happens if the job name isn't git-repo. In this case, we need to do a few things:

  1. Clone the git repository in question to the shared workspace (if it hasn't been done already)
  2. Fetch new commits on the shared repository copy
  3. Check out the right commit
  4. Copy it to the run-specific directory
  5. Execute the build script

Additionally, we need to ensure that points #1 to #4 are not done by multiple jobs that are running at the same time, since that would probably confuse things and induce weird and undesirable behaviour. This might happen if we push multiple commits at once, for example - since the git post-receive hook (which I'll be talking about in a future post) queues 1 run per commit.

We can make sure of this by using flock. It's actually a feature provided by the Linux Kernel, which allows a single process to obtain exclusive access to a resource on disk. Since each Laminar job has it's own workspace as described above, we can abuse this by doing an flock on the workspace directory. This will ensure that only 1 run per job is accessing the workspace area at once:

# Acquire a lock for this repo
exec 9<"${WORKSPACE}";
flock --exclusive 9;

echo "[${SECONDS}] Lock acquired";

Nice. Next, we need to clone the repository into the shared workspace if we haven't already:

cd "${WORKSPACE}";

# If we haven't already, clone the repository
git_directory="$(echo "${GIT_REPO_URL}" | grep -oP '(?<=/)(.+)(?=.git$)')";
if [ ! -d "${git_directory}" ]; then
    echo "[${SECONDS}] Cloning repository";
    git clone "${GIT_REPO_URL}";
fi
cd "${git_directory}";

Then, we need to fetch any new commits:

# Pull down any updates that are available
echo "[${SECONDS}] Downloading commits";
git fetch origin;

....and check out the one we're supposed to be building:

# Checkout the commit we're interested in testing
echo "[${SECONDS}] Checking out ${GIT_COMMIT_REF}";
git checkout "${GIT_COMMIT_REF}";

Then, we need to copy the repo to the run-specific directory. This is important, since the run might create new files - and we don't want multiple runs running in the same directory at the same time.


echo "[${SECONDS}] Linking source to run directory";
# Hard-link the repo content to the run directory
# This is important because then we can allow multiple runs of the same repo at the same time without using extra disk space
# -r    Recursive mode
# -a    Preserve permissions
# -l    Hardlink instead of copy
cp -ral ./ "${run_directory}";
# Don't forget the .git directory, .gitattributes, .gitmodules, .gitignore, etc.
# This is required for submodules and other functionality, but likely won't be edited - hence we can hardlink here (I think).
# NOTE: If we see weirdness with multiple runs at a time, then we'll need to do something about this.
cp -ral ./.git* "${run_directory}/.git";

I'm using hard linking here for efficiency - I'm banking on the fact that the build script I call isn't going to modify any existing files. Thinking about it, I should do a git reset --hard there just in case - though then I'd have all sorts of nasty issues with timing problems.

So far, I haven't had any issues. If I do, then I'll just disable the hard linking and copy instead. This entire script assumes a trusted environment - i.e. it trusts that the code being executed is not malicious. To this end, it's only suitable for personal projects and the like.

For it to be useful in untrusted environments, it would need to avoid hard linking and execute the build script inside a container - e.g. using LXD or Docker.

Moving on, we next need to release that flock and return to the run-specific directory:

# Go back to the job-specific run directory
cd "${run_directory}";

# Release the lock
exec 9>&- # Close file descriptor 9 and release lock

echo "[${SECONDS}] Lock released";

At this point, we're all set up to run the build script. We need to find it first though. I've currently got 2 standards I'm using across my repositories: build and build.sh. This is easy to automate:

build_script="./build";
if [ ! -x "${build_script}" ]; then build_script="./build.sh"; fi
# FUTURE: Add Makefile support here?
if [ ! -x "${build_script}" ]; then
    echo "[${SECONDS}] Error: Couldn't find the build script, or it wasn't marked as executable." >&2;
    exit 1;
fi

Now that we know where it is, we can execute it. Before we do though, as a little extra I like to run shellcheck over it - since we assume that it's a shell script too (though it might call something that isn't a shell script):

echo "----------------------------------------------------------------";
echo "------------------ Shellcheck of build script ------------------";
set +e; # Allow shellcheck errors - we just warn about them
shellcheck "${build_script}";
set -e;
echo "----------------------------------------------------------------";

I can highly recommend shellcheck - it finds a number of potential issues in both style and syntax that might cause your shell scripts to behave in unexpected ways. I've learnt a bunch about shell scripting and really improved my skills from using it on a regular basis.

Finally, we can now actually execute the build script:

echo "[${SECONDS}] Executing '${build_script} ci'";

nice -n10 ${build_script} ci

I pass the argument ci here, since the lantern build engine takes task names as arguments on the command line. If it's not a lantern script, then it can be interpreted as a helpful hint as to the environment that it's running in.

I also nice it to push it into the background, since I actually have my Laminar CI server running on a Raspberry Pi and it's resources are rather limited. I found oddly that I'd lose other essential services (e.g. SSH) if I didn't do this for some reason - since build tasks are usually quite computationally expensive.

That completes the build script. Of course, when the above finishes executing the trap that we set earlier will trigger and the build status reported. I'll include the full script at the bottom of this post.

This was a long post! We've taken a deep dive into the engine that powers my build system. In the next few posts, I'd like to talk about the git post-receive hook I've been mentioning that triggers this job. I'd also like to talk formally about the lantern build engine - what it is, where it came from, and how it works.

Found this interesting? Spotted a mistake? Got a suggestion? Confused about something? Comment below!

#!/usr/bin/env bash
set -e; # Don't allow errors

# Check that all the right variables are present
if [ -z "${GIT_REPO_NAME}" ]; then echo -e "Error: The environment variable GIT_REPO_NAME isn't set." >&2; exit 1; fi
if [ -z "${GIT_REF_NAME}" ]; then echo -e "Error: The environment variable GIT_REF_NAME isn't set." >&2; exit 1; fi
if [ -z "${GIT_REPO_URL}" ]; then echo -e "Error: The environment variable GIT_REPO_URL isn't set." >&2; exit 1; fi
if [ -z "${GIT_COMMIT_REF}" ]; then echo -e "Error: The environment variable GIT_COMMIT_REF isn't set." >&2; exit 1; fi
if [ -z "${GIT_AUTHOR}" ]; then echo -e "Error: The environment variable GIT_AUTHOR isn't set." >&2; exit 1; fi

# It's checked directly anyway
# shellcheck disable=SC1091
source source_regex_match.sh;

GIT_REF_TYPE="$(regex_match "${GIT_REF_NAME}" 'refs/([a-z]+)')";

if [[ "${GIT_REF_TYPE}" == "tags" ]]; then
    GIT_TAG_NAME="$(regex_match "${GIT_REF_NAME}" 'refs/tags/(.*)$')";
fi

# NOTE: These only work with SSH urls.
GIT_REPO_OWNER="$(echo "${GIT_REPO_URL}" | grep -Po '(?<=:)[^/]+(?=/)')";
GIT_REPO_NAME_SHORT="$(echo "${GIT_REPO_URL}" | grep -Po '(?<=/)[^/]+(?=\.git$)')";
GIT_SERVER_DOMAIN="$(echo "${GIT_REPO_URL}" | grep -Po '(?<=@)[^/]+(?=:)')";

export GIT_REPO_OWNER GIT_REPO_NAME_SHORT GIT_SERVER_DOMAIN GIT_REF_TYPE GIT_TAG_NAME;

###############################################################################

# Example URL: [email protected]:sbrl/rhinoreminds.git
# Environment variables:
#   GIT_REPO_NAME           git-starbeamrainbowlabs-com-sbrl-rhinoreminds
#   GIT_REF_NAME            refs/heads/master, refs/tags/v0.1.1-build7
#   GIT_REF_TYPE            heads, tags
#       Determined dynamically from GIT_REF_NAME.
#   GIT_TAG_NAME            v0.1.1-build7
#       Determined dynamically from GIT_REF_NAME, only set if GIT_REF_TYPE == "tags".
#   GIT_REPO_URL            [email protected]:sbrl/rhinoreminds.git
#   GIT_COMMIT_REF          e23b2e0f3c0b9f48effebca24db48d9a3f028a61
#   GIT_AUTHOR              bob
# Generated:
#   GIT_SERVER_DOMAIN       git.starbeamrainbowlabs.com
#   GIT_REPO_OWNER          sbrl
#   GIT_REPO_NAME_SHORT     rhinoreminds
#   GIT_RUN_SOURCE          github
#       Not always set. If not set then assume git.starbeamrainbowlabs.com


send-status-gitea "${GIT_COMMIT_REF}" "pending" "Executing build....";

# Runs on exit, no matter what
cleanup() {
    original_exit_code="$?";

    status="success";
    description="Build ${RUN} succeeded in $(human-duration "${SECONDS}").";
    if [[ "${original_exit_code}" -ne "0" ]]; then
        status="failed";
        description="Build failed with exit code ${original_exit_code} after $(human-duration "${SECONDS}")";
    fi

    send-status-gitea "${GIT_COMMIT_REF}" "${status}" "${description}";
}

trap cleanup EXIT;

###############################################################################


repo_job_name="$(echo "${GIT_REPO_NAME}" | tr '/' '--')";
if [ "${JOB}" == "git-repo" ]; then
    # If the job file doesn't exist, create it
    # We create a symlink here because this is a 'smart' job - whose
    # behaviour changes dynamically based on the job name.
    if [ ! -e "${LAMINAR_HOME}/cfg/jobs/${repo_job_name}.run" ]; then
        pushd "${LAMINAR_HOME}/cfg/jobs";
        ln -s "git-repo.run" "${repo_job_name}.run";
        popd
    fi

    # Queue our new hologram
    LAMINAR_REASON="git push by ${GIT_AUTHOR} to ${GIT_REF_NAME}" laminarc queue "${repo_job_name}" GIT_REPO_NAME="${GIT_REPO_NAME}" GIT_REF_NAME="${GIT_REF_NAME}" GIT_REPO_URL="${GIT_REPO_URL}" GIT_COMMIT_REF="${GIT_COMMIT_REF}" GIT_AUTHOR="${GIT_AUTHOR}";
    # If we got to here, we queued the hologram successfully
    # Clear the trap, because we know that the trap for the hologram will fire
    # This avoids sending a 2nd status to Gitea, linking the user to the wrong place
    trap - EXIT;

    exit 0;
fi

# We're running in hologram mode!

# Remember the run directory - we'll need it later
run_directory="$(pwd)";

# Important directories:
# $WORKSPACE        Shared between all runs of a job
# $run_directory    The initial directory a run lands in. Empty and run-specific.
# $ARCHIVE          Also run-speicfic, but the contents is persisted after the run ends


# Acquire a lock for this repo
#laminarc lock "${JOB}-workspace";
exec 9<"${WORKSPACE}";
flock --exclusive 9;
###############################################################################
# No need to allow errors here, because the lock will automagically be released 
# if the process crashes, as that'll close the file description anyway :P
echo "[${SECONDS}] Lock acquired";

cd "${WORKSPACE}";

# If we haven't already, clone the repository
git_directory="$(echo "${GIT_REPO_URL}" | grep -oP '(?<=/)(.+)(?=.git$)')";
if [ ! -d "${git_directory}" ]; then
    echo "[${SECONDS}] Cloning repository";
    git clone "${GIT_REPO_URL}";
fi
cd "${git_directory}";


# Pull down any updates that are available
echo "[${SECONDS}] Downloading commits";
git fetch origin;
# Checkout the commit we're interested in testing
echo "[${SECONDS}] Checking out ${GIT_COMMIT_REF}";
git checkout "${GIT_COMMIT_REF}";

echo "[${SECONDS}] Linking source to run directory";
# Hard-link the repo content to the run directory
# This is important because then we can allow multiple runs of the same repo at the same time without using extra disk space
# -r    Recursive mode
# -a    Preserve permissions
# -l    Hardlink instead of copy
cp -ral ./ "${run_directory}";
# Don't forget the .git directory, .gitattributes, .gitmodules, .gitignore, etc.
# This is required for submodules and other functionality, but likely won't be edited - hence we can hardlink here (I think).
# NOTE: If we see weirdness with multiple runs at a time, then we'll need to do something about this.
cp -ral ./.git* "${run_directory}/.git";
echo "[${SECONDS}] done";

# Go back to the job-specific run directory
cd "${run_directory}";

###############################################################################
# Release the lock
exec 9>&- # Close file descriptor 9 and release lock
#laminarc release "${JOB}-workspace";


echo "[${SECONDS}] Lock released";

echo "[${SECONDS}] Finding build script";

build_script="./build";
if [ ! -x "${build_script}" ]; then build_script="./build.sh"; fi
# FUTURE: Add Makefile support here?
if [ ! -x "${build_script}" ]; then
    echo "[${SECONDS}] Error: Couldn't find the build script, or it wasn't marked as executable." >&2;
    exit 1;
fi


echo "[${SECONDS}] Executing '${build_script} ci'";

echo "----------------------------------------------------------------";
echo "------------------ Shellcheck of build script ------------------";
set +e; # Allow shellcheck errors - we just warn about them
shellcheck "${build_script}";
set -e;
echo "----------------------------------------------------------------";


nice -n10 ${build_script} ci

Monitoring HTTP server response time with collectd and a bit of bash

In the spirit of the last few posts I've been making here (A and B), I'd like to talk a bit about collectd, which I use to monitor the status of my infrastructure. Currently this consists of the server you've connected to in order to view this webpage, and a Raspberry Pi that acts as a home file server.

I realised recently that monitoring the various services that I run (such as my personal git server for instance) would be a good idea, as I'd rather like to know when they go down or act abnormally.

As a first step towards this, I decided to configure my existing collectd setup to monitor the response time of the HTTP endpoints of these services. Later on, I can then configure some alerts to message me when something goes down.

My first thought was to check the plugin list to see if there was one that would do the trick. As you might have guessed by the title of this post, however, such an easy solution would be too uninteresting and not worthy of writing a blog post.

Since such a plugin doesn't (yet?) exist, I turned to the exec plugin instead.

In short, it lets you write a program that writes to the standard output in the collectd plain text protocol, which collectd then interprets and adds to whichever data storage backend you have configured.

Since shebangs are a thing on Linux, I could technically choose any language I have an interpreter installed for, but to keep things (relatively) simple, I chose Bash, the language your local terminal probably speaks (unless it speaks zsh or fish instead).

My priorities were to write a script that is:

  1. Easy to reconfigure
  2. Ultra lightweight

Bash supports associative arrays, so I can cover point #1 pretty easily like this:

declare -A targets=(
    ["main_website"]="https://starbeamrainbowlabs.com/"
    ["git"]="https://git.starbeamrainbowlabs.com/"
    # .....
)

Excellent! Covering point #2 will be an on-going process that I'll need to keep in mind as I write this script. I found this GitHub repository a while back, which has served as a great reference point in the past. Here's hoping it'll be useful this time too!

It's important to note the structure of the script that we're trying to write. Collectd exec scripts have 2 main environment variables we need to take notice of:

  • COLLECTD_HOSTNAME - The hostname of the local machine
  • COLLECTD_INTERVAL - Interval at which we should collect data. Defined in collectd.conf.

The script should write to the standard output the values we've collected in the collectd plain text format every COLLECTD_INTERVAL. Collectd will automatically ensure that only 1 instance of our script is running at once, and will also automatically restart it if it crashes.

To run a command regularly at a set interval, we probably want a while loop like this:

while :; do
    # Do our stuff here

    sleep "${COLLECTD_INTERVAL}";
done

This is a great start, but it isn't really compliant with objective #2 we defined above. sleep is actually a separate command that spawns a new process. That's an expensive operation, since it has to allocate memory for a new stack and create a new entry in the process table.

We can avoid this by abusing the read command timeout, like this:

# Pure-bash alternative to sleep.
# Source: https://blog.dhampir.no/content/sleeping-without-a-subprocess-in-bash-and-how-to-sleep-forever
snore() {
    local IFS;
    [[ -n "${_snore_fd:-}" ]] || exec {_snore_fd}<> <(:);
    read ${1:+-t "$1"} -u $_snore_fd || :;
}

Thanks to bolt for this.

Next, we need to iterate over the array of targets we defined above. We can do that with a for loop:

while :; do
    for target in "${!targets[@]}"; do
        check_target "${target}" "${targets[${target}]}"
    done

    snore "${COLLECTD_INTERVAL}";
done

Here we call a function check_target that will contain our main measurement logic. We've changed sleep to snore too - our new subprocess-less sleep alternative.

Note that we're calling check_target for each target one at a time. This is important for 2 reasons:

  • We don't want to potentially skew the results by taking multiple measurements at once (e.g. if we want to measure multiple PHP applications that sit in the same process poll, or measure more applications than we have CPUs)
  • It actually spawns a subprocess for each function invocation if we push them into the background with the & operator. As I've explained above, we want to try and avoid this to keep it lightweight.

Next, we need to figure out how to do the measuring. I'm going to do this with curl. First though, we need to setup the function and bring in the arguments:

# $1 - target name
# $2 - url
check_target() {
    local target_name="${1}"
    local url="${2}";

    # ......
}

Excellent. Now, let's use curl to do the measurement itself:

curl -sS --user-agent "${user_agent}" -o /dev/null --max-time 5 -w "%{http_code}\n%{time_total}" "${url}"

This looks complicated (and it probably is to some extent), but let's break it down with the help of explainshell.

Part Meaning
-sS Squashes all output except for errors and the bits we want. Great for scripts like ours.
--user-agent Specifies the user agent string to use when making a request. All good internet citizens should specify a descriptive one (more on this later).
-o /dev/null We're not interested in the content we download, so this sends it straight to the bin.
--max-time 5 This sets a timeout of 5 seconds for the whole operation - after which curl will throw an error and return with exit code 28.
-w "%{http_code}\n%{time_total}" This allows us to pull out metadata about the request we're interested in. There's actually a whole range available, but for now I'm interested in how long it took and the response code returned
"${url}" Specifies the URL to send the request to. curl does actually support making more than 1 request at once, but utilising this functionality is out-of-scope for now (and we'd get skewed results because it re-uses connections - which is normally really helpful & performance boosting)

To parse the output we get from curl, I found the readarray command after going a bit array mad at the beginning of this post. It pulls every line of input into a new slot in an array for us - and since we can control the delimiter between values with curl, it's perfect for parsing the output. Let's hook that up now:

readarray -t result < <(curl -sS --user-agent "${user_agent}" -o /dev/null --max-time 5 -w "%{http_code}\n%{time_total}" "${url}");

The weird command < <(another_command); syntax is process substitution. It's a bit like the another_command | command syntax, but a bit different. We need it here because readarray parses the values into a new array variable in the current context, and if we use the a | b syntax here, we instantly lose access to the variable it creates because a subprocess is spawned (and readarray is a bash builtin) - hence the weird process substitution.

Now that we've got the output from curl parsed and ready to go, we need to handle failures next. This is a little on the nasty side, as by default bash won't give us the non-zero exit code from substituted processes. Hence, we need to tweak our already long arcane incantation a bit more:

readarray -t result < <(curl -sS --user-agent "${user_agent}" -o /dev/null --max-time 5 -w "%{http_code}\n%{time_total}\n" "${url}"; echo "${PIPESTATUS[*]}");

Thanks to this answer on StackOverflow for ${PIPESTATUS}. Now, we have array called result with 3 elements in it:

Index Value
0 The HTTP response code
1 The time taken in seconds
2 The exit code of curl

With this information, we can now detect errors and abort continuing if we detect one. We know there was an error if any of the following occur:

  • curl returned a non-zero exit code
  • The HTTP response code isn't 2XX or 3XX

Let's implement that in bash:

if [[ "${result[2]}" -ne 0 ]] || [[ "${result[0]}" -lt "200" ]] || [[ "${result[0]}" -gt "399" ]]; then
    return
fi

Again, let's break it down:

  • [[ "${result[2]}" -ne 0 ]] - Detect a non-zero exit code from curl
  • [[ "${result[0]}" -lt "200" ]] - Detect if the HTTP response code is less than 200
  • [[ "${result[0]}" -gt "399" ]] - Detect if the HTTP response code is greater than 399

In the future, we probably want to output a notification here of some sort instead of just simply silently returning, but for now it's fine.

Finally, we can now output the result in the right format for collectd to consume. Collectd operates on identifiers, values, and intervals. A bit of head-scratching and documentation reading later, and I determined the correct identifier format for the task. I wanted to have all the readings on the same graph so I could compare the different response times (just like the ping plugin does), so we want something like this:

bobsrockets.com/http_services/response_time-TARGET_NAME

....where we replace bobsrockets.com with ${COLLECTD_HOSTNAME}, and TARGET_NAME with the name of the target we're measuring (${target_name} from above).

We can do this like so:

echo "PUTVAL \"${COLLECTD_HOSTNAME}/http_services/response_time-${target_name}\" interval=${COLLECTD_I
NTERVAL} N:${result[1]}";

Here's an example of it in action:

PUTVAL "HOSTNAME_HERE/http_services/response_time-git" interval=300.000 N:0.118283
PUTVAL "HOSTNAME_HERE/http_services/response_time-main_website" interval=300.000 N:0.112073

It does seem to run through the items in the array in a rather strange order, but so long as it does iterate the whole lot, I don't really care.

I'll include the full script at the bottom of this post, so all that's left to do is to point collectd at our new script like this in /etc/collectd.conf:

LoadPlugin  exec

# .....

<Plugin exec>
    Exec    "nobody:nogroup"        "/etc/collectd/http_response_times.sh"  "measure"
</Plugin>

I've added measure as an argument there for future-proofing, as it looks like we may have to run a separate instance of the script for sending notifications if I understand the documentation correctly (I need to do some research.....).

Very cool. It's taken a few clever tricks, but we've managed to write an efficient script for measuring http response times. We've made it more efficient by exploiting read timeouts and other such things. While we won't gain a huge amount of speed from this (bash is pretty lightweight already - this script is weighing in at just ~3.64MiB of private RAM O.o), it will all add up over time - especially considering how often this will be running.

In the future, I'll definitely want to take a look at implementing some alerts to notify me if a service is down - but that will be a separate post, as this one is getting quite long :P

Found this interesting? Got another way of doing this? Curious about something? Comment below!


Full Script

#!/usr/bin/env bash
set -o pipefail;

# Variables:
#   COLLECTD_INTERVAL   Interval at which to collect data
#   COLLECTD_HOSTNAME   The hostname of the local machine

declare -A targets=(
    ["main_website"]="https://starbeamrainbowlabs.com/"
    ["webmail"]="https://mail.starbeamrainbowlabs.com/"
    ["git"]="https://git.starbeamrainbowlabs.com/"
    ["nextcloud"]="https://nextcloud.starbeamrainbowlabs.com/"
)
# These are only done once, so external commands are ok
version="0.1+$(date +%Y%m%d -r $(readlink -f "${0}"))";

user_agent="HttpResponseTimeMeasurer/${version} (Collectd Exec Plugin; $(uname -sm)) bash/${BASH_VERSION} curl/$(curl --version | head -n1 | cut -f2 -d' ')";

# echo "${user_agent}"

###############################################################################

# Pure-bash alternative to sleep.
# Source: https://blog.dhampir.no/content/sleeping-without-a-subprocess-in-bash-and-how-to-sleep-forever
snore() {
    local IFS;
    [[ -n "${_snore_fd:-}" ]] || exec {_snore_fd}<> <(:);
    read ${1:+-t "$1"} -u $_snore_fd || :;
}

# Source: https://github.com/dylanaraps/pure-bash-bible#split-a-string-on-a-delimiter
split() {
    # Usage: split "string" "delimiter"
    IFS=$'\n' read -d "" -ra arr <<< "${1//$2/$'\n'}"
    printf '%s\n' "${arr[@]}"
}

# Source: https://github.com/dylanaraps/pure-bash-bible#get-the-number-of-lines-in-a-file
# Altered to operate on the standard input.
count_lines() {
    # Usage: lines <"file"
    mapfile -tn 0 lines
    printf '%s\n' "${#lines[@]}"
}

###############################################################################

# $1 - target name
# $2 - url
check_target() {
    local target_name="${1}"
    local url="${2}";

    readarray -t result < <(curl -sS --user-agent "${user_agent}" -o /dev/null --max-time 5 -w "%{http_code}\n%{time_total}\n" "${url}"; echo "${PIPESTATUS[*]}");

    # 0 - http response code
    # 1 - time taken
    # 2 - curl exit code

    # Make sure the exit code is non-zero - this includes if curl hits a timeout error
    # Also ensure that the HTTP response code is valid - any 2xx or 3xx response code is ok
    if [[ "${result[2]}" -ne 0 ]] || [[ "${result[0]}" -lt "200" ]] || [[ "${result[0]}" -gt "399" ]]; then
        return
    fi

    echo "PUTVAL \"${COLLECTD_HOSTNAME}/http_services/response_time-${target_name}\" interval=${COLLECTD_INTERVAL} N:${result[1]}";
}

while :; do
    for target in "${!targets[@]}"; do
        # NOTE: We don't use concurrency here because that spawns additional subprocesses, which we want to try & avoid. Even though it looks slower, it's actually more efficient (and we don't potentially skew the results by measuring multiple things at once)
        check_target "${target}" "${targets[${target}]}"
    done

    snore "${COLLECTD_INTERVAL}";
done
Art by Mythdael