Stardust | Starbeamrainbowlabs

How (not) to recover a consul cluster

Hello again! I'm still getting used to a new part-time position at University which I'm not quite ready to talk about yet, but in the mean time please bear with me as I shuffle my schedule around.

As I've explained previously on here, I have a consul cluster (superglue service discovery!) that forms the backbone of my infrastructure at home. Recently, I had a small powercut that knocked everything offline, and as the recovery process was quite interesting I thought I'd blog about it here.

The issue at had happened at about 5pm, but I only discovered it was a problem until a few horus later when I got home. Essentially, a small powercut knocked everything offline. While my NAS rebooted automatically afterwards, my collection of Raspberry Pis weren't so lucky. I can only suspect that they were caught in some transient state or something. None of them responded when I pinged them, and later inspection of the logs on my collectd instance revealed that they were essentially non-functional until after they were rebooted manually.

A side effect of this was that my Consul (and, by extension, my Nomad cluster) cluster was knocked offline.

Anyway, at first I only rebooted the controller host (that has both a Consul and Nomad server running on it, but does not accept and run jobs). This rebooted just fine and came back online, so I then rebooted my monitoring box (that also runs continuous integration), which also came back online.

Due to the significantly awkward physical location I keep my cluster in with the rest of the Pis, I decided to flip the power switch on the extension to restart all my hosts at the same time.

While this worked..... it also caused my cluster controller node to reboot, which caused its raft epoch number to increment by 1... which broke the quorum (agreement) of my cluster, and required manual intervention to resolve.

Raft quorum

To understand the specific issue here, we need to look at the Raft consensus algorithm. Raft is, as the name suggests, a consensus algorithm. Such an algorithm is useful when you have a cluster of servers that need to work together in a redundant fault-tolerant fashion on some common task, such as in our case Consul (service discovery) and Nomad (task scheduling).

The purpose of a raft server is to maintain agreement amongst all nodes in a cluster as to the global state of an application. It does this using a distributed log that it replicates through a fancy but surprisingly simple algorithm.

At the core of this algorithm is the concept of a leader. The cluster leader is responsible for managing and committing updates to the global state, as well as sending out the global state to everyone else in the cluster. In the case of Consul, the Consul servers are the cluster (the clients simply connect back to whichever servers are available) - and I have 3 of them, since Raft requires an odd number of nodes.

When the cluster first starts up or the leader develops a fault (e.g. someone sets off a fork bomb on it just for giggles), an election occurs to decide on a new leader. The election term number (or epoch number) is incremented by one, and everyone votes on who the new leader should be. The node with the most votes becomes the new leader, and quorum (agreement) is achieved across the entire cluster.

Consul and Raft

In the case of Consul, everyone must cast a vote for the vote to be considered valid, otherwise the vote is considered invalid and the election process must begin again. Crucially, the election term number must also be the same across everyone voting.

In my case, because I started my cluster controller and then rebooted it before it had a chance to achieve quorum, it incremented it's election term number and additional time than the rest of the cluster did, which caused the cluster to fail to reach quorum as the other 2 nodes in the Consul server cluster consider the controller node's vote to be invalid, yet they still demanded that all servers vote to elect a new leader.

The practical effect of this was tha because the Consul cluster failed to agree on who the leader should be, the Nomad cluster (which hangs off the Consul cluster, using it to find each other) also failed to start and subsequently reach quorum, which knocked all my jobs offline.

The solution

Thankfully, the Hashicorp Consul documentation for this specific issue is fabulous:

https://developer.hashicorp.com/consul/tutorials/datacenter-operations/recovery-outage#failure-of-a-server-in-a-multi-server-cluster

To summarise:

Boot the cluster as normal if it isn't booted already
Stop the failed node
Create a special config file (raft/peers.json) that will cause the failed node to drop it's state and accept the state of the incomplete cluster, allowing it to rejoin and the cluster gain collective quorum once more.

The documentation to perform this recovery protocol is quite clear. While there is an option to recover a failed node if you still have a working cluster with a leader, in my case I didn't so I had to use the alternate route.

Conclusion

I've talked briefly about an interesting issue that caused my Consul cluster to break quorum, which inadvertently brought my entire infrastructure down until resolved the issue.

While Consul is normally really quite resilient, you can break it if you aren't careful. Having an understanding of the underlying consensus algorithm Raft is very helpful to diagnosing and resolving issues, though the error messages and documentation I looked through were generally clear and helpful.

Centralising logs with rsyslog

I manage quite a number of servers at this point, and something that's been on my mind for a while now is centralising all the log files generated by them. By this, specifically I mean that I want to automatically gather all logs generated by all the systems I manage into a single place in real time.

While there are enterprise-grade log management setups such as the ELK stack (elasticsearch, logstash, and kibana), as far as I'm aware they are all quite heavy and given my infrastructure is Raspberry Pi based (seriously, they use hardly any electricity at all compared to a regular desktop PC), with such a setup I would likely need multiple Pis to run it.

With this in mind, I'm opting for a different kind of log management system, which I'm basing on rsyslog (which is installed by default in most Linux distros) and lnav (which I've blogged about before: lnav basics tutorial), which runs much lighter, requiring only a fraction of a Raspberry Pi to operate, which is good since the Raspberry Pi I've dedicated to monitoring the rest of the infrastructure currently also handles:

Continuous Integration: Laminar (this will eventually be a Docker container on my Hashicorp Nomad cluster)
Collectd (Collectd is really easy to setup and runs so light, I love it)

I'm sure you might be asking yourself what the purpose of this is. My reasoning is fourfold:

Having all the logs in one place makes them easier to analyse all at once, without having to SSH into many different servers
If a box goes down, then I can read the logs from it before start attempting to fix it, giving me a heads up as to what the problem is (this, in conjunction with my collectd monitoring system)
On the Raspberry Pis I manage, this prolongs the life of the microSD cards by reducing the number of writes thereto
I gain a little bit of security, in that if a box is compromised, then unless the attacker also gains access to my logging server, then they can't erase their tracks as easily as might otherwise have done

With all this in mind, I thought that it's about time I actually did something about this. I've found that while the solution is actually really quite simple, it's not particularly easy to find, so I thought I'd post about it here.

In my setup, I'm going to be using a Raspberry Pi 4 4GB RAM I've dubbed eldarion, which is the successor to an earlier Raspberry Pi 3B+ that died some years prior I called elessar as the server upon which I centralise my logs. It has a 120GB SATA SSD attached in a case that used to house a WD PiDrive (they don't sell those anymore :-/) that I had lying around, which I've formatted with Btrfs.

Before we begin, let's outline the setup we're aiming for with a diagram to avoid confusion:

A diagram of the rsyslog setup we're aiming for. See explanation below.

eldarion will host the rsyslog server (which is essentially just a reconfiguration of the existing rsyslog server it is most likely already running), while other servers connect using the syslog protocol via a TCP connection, which is encrypted with TLS, using the GnuTLS engine (the default built into rsyslog). TLS here is important, since logs are naturally rather sensitive as I'm sure you can imagine.

To follow along here, you will need a valid Let's Encrypt certificate. It just so happens that I have a web server hosting my collectd graph panel interface, so I'm using that.

Of course, rsyslog can be configured in arbitrarily complex ways (such as having clients send logs to servers that they themselves forward to yet other servers), but at least for now I'm keeping it (relatively) simple.

Preparing the server

To start this process, we want to ensure the logs for the local system are stored in the right place. In my case, I have my SSD mounted to /mnt/eldarion-data2, so I want to put my logs in /mnt/eldarion-data2/syslog/localhost. There are 2 ways of accomplishing this:

Reconfigure rsyslog to save logs elsewhere
Be lazy, and bind mount the target location to /var/log

Since I'm feeling lazy today, I'm going to go with option 2 here. It's also a good idea if a program is badly written and decides it's a brilliant idea to write logs directly to /var/log itself instead of going through syslog.

If you're using DietPi, before you continue, do sudo dietpi-software and remove the existing logging system.

A bind mount is like a hard link of a directory, in that it makes a directory appear in multiple places at once. It acts as a separate "filesystem" though I assume to allow for avoiding infinite loops. They are also the tech behind volumes in Docker's backend containerd.

Open /etc/fstab for editing, and something like this on a new line:

/mnt/eldarion-data2/syslog/localhost    /var/log    none    auto,defaults,bind  0   0

..where /mnt/eldarion-data2/syslog/localhost is the location we want the data to be stored, and /var/log is the location we want to bind mount it to. Save and close /etc/fstab, and then mount the bind mount like so. Make sure /var/log is empty before mounting!

sudo mount /var/log

Next, we need to install some dependencies:

sudo apt install rsyslog rsyslog-gnutls

For some strange reason, TLS support is in a separate package on Debian-based systems. You'll need to investigate package names and translate this command for your distribution, of course.

Configuring the server

Now we have that taken care of, we can actually configure our server. Open /etc/rsyslog.conf for editing, and at the top put this:

# The $Thing syntax is apparently 'legacy', but I can't find how else we're supposed to do this
$DefaultNetstreamDriver gtls
$DefaultNetstreamDriverCAFile   /etc/letsencrypt/live/mooncarrot.space/chain.pem
$DefaultNetstreamDriverCertFile /etc/letsencrypt/live/mooncarrot.space/cert.pem
$DefaultNetstreamDriverKeyFile  /etc/letsencrypt/live/mooncarrot.space/privkey.pem

# StreamDriver.Mode=1 means TLS-only mode
module(load="imtcp" MaxSessions="500" StreamDriver.Mode="1" StreamDriver.AuthMode="anon")
input(type="imtcp" port="514")

$template remote-incoming-logs,"/mnt/eldarion-data2/syslog/hosts/%HOSTNAME%/%PROGRAMNAME%.log"
*.* ?remote-incoming-logs

You'll need to edit these bits to match your own setup:

/etc/letsencrypt/live/mooncarrot.space/: Path to the live directory there that contains the symlinks to the certs your Let's Encrypt client obtained for you
/mnt/eldarion-data2/syslog/hosts: The path to the directory we want to store the logs in

Save and close this, and then restart your server like so:

sudo systemctl restart rsyslog.service

Then, check to see if there were any errors:

sudo systemctl status rsyslog.service

Lastly, I recommend assigning a DNS subdomain to the server hosting the logs, such as logs.mooncarrot.space in my case. A single server can have multiple domain names of course, and this just makes it convenient if we every move the rsyslog server elsewhere - as we won't have to go around and edit like a dozen config files (which would be very annoying and tedious).

Configuring a client

Now that we have our rsyslog server setup, it should be relatively straightforward to configure a client box to send logs there. This is a 3 step process:

Configure the existing /var/log to be an in-memory tmpfs to avoid any potential writes to disk
Add a cron script to wipe /var/log every hour to avoid it getting full by accident
Reconfigure (and install, if necessary) rsyslog to send logs to our shiny new server rather than save them to disk

If you haven't already confgiured /var/log to be an in-memory tmpfs, it is relatively simple. If you're unsure whether it is or not, do df -h.

First, open /etc/fstab for editing, and add the following line somewhere:

tmpfs /var/log tmpfs size=50M,noatime,lazytime,nodev,nosuid,noexec,mode=1777

Then, save + close it, and mount /var/log. Again, make sure /var/log is empty before mounting! Weird things happen if you don't.

sudo mount /var/log

Secondly, save the following to /etc/cron.hourly/clear-logs:

#!/usr/bin/env bash
rm -rf /var/log/*

Then, mark it executable:

sudo chmod +x /etc/cron.hourly/clear-logs

Lastly, we can reconfigure rsyslog. The specifics of how you do this varies depending on what you want to achieve, but for a host where I want to send all the logs to the rsyslog server and avoid saving them to the local in-memory tmpfs at all, I have a config file like this:

#################
#### MODULES ####
#################

module(load="imuxsock") # provides support for local system logging
module(load="imklog")   # provides kernel logging support
#module(load="immark")  # provides --MARK-- message capability

###########################
#### GLOBAL DIRECTIVES ####
###########################

$IncludeConfig /etc/rsyslog.d/*.conf

# Where to place spool and state files
$WorkDirectory /var/spool/rsyslog

###############
#### RULES ####
###############
$DefaultNetstreamDriverCAFile   /etc/ssl/isrg-root-x1-cross-signed.pem
$DefaultNetstreamDriver         gtls
$ActionSendStreamDriverMode     1       # Require TLS
$ActionSendStreamDriverAuthMode anon
*.* @@(o)logs.mooncarrot.space:514  # Forward everything to our rsyslog server

#
# Emergencies are sent to everybody logged in.
#
*.emerg             :omusrmsg:*

The rsyslog config file in question this needs to be saved to is located at /etc/rsyslog.conf. In this case, I replace the entire config file with the above, but you can pick and choose (e.g. on some hosts I want to save to the local disk and and to the rsyslog server).

Un the above you'll need to change the logs.mooncarrot.space bit - this should be the (sub)domain that you pointed at your rsyslog server earlier. The number after the colon (514) is the port number. The *.* tells it to send everything to the remote rsyslog server.

Before we're done here, we need to provide the rsyslog client with the CA certificate of the server (because, apparently, it isn't capable of ferreting around in /etc/ssl/certs like everyone else is). Since I'm using Let's Encrypt here, I downloaded their root certificate like this and it seemed to do the job:

sudo curl -sSL https://letsencrypt.org/certs/isrg-root-x1-cross-signed.pem -o /etc/ssl/isrg-root-x1-cross-signed.pem

Of course, one could generate their own CA and do mutual authentication for added security, but that's complicated, lots of effort, and probably unnecessary for my purposes as far as I can tell. I'll leave a link in the sources and further reading on how to do this if you're interested.

If you have a different setup, it's the $DefaultNetstreamDriverCAFile in the above you need to change to point at your actual CA certificate.

With that all configured, we can now restart the rsyslog client:

sudo systemctl restart rsyslog.service

...and, of course, check to see if there were any errors:

sudo systemctl status rsyslog.service

Finally, we also need to configure logrotate to rotate all these new log files. First, install logrotate if the logrotate command doesn't exist:

sudo apt install logrotate

Then, place the following in the file /etc/logrotate.d/centralisedlogging:

/mnt/eldarion-data2/syslog/hosts/*/*.log {
    rotate 12
    weekly
    missingok
    notifempty
    compress
    delaycompress
}

Of course, you'll want to replace /mnt/eldarion-data2/syslog/hosts/ with the directory you're storing the logs from the remote server in, and also customise the log rotation. For example, the 12 there is the number of old log files to keep, and weekly can be swapped for daily or even monthly if you like.

Conclusion

This has been a very quick whistle-stop tour of setting up an rsyslog server to centralise your logs. We've setup our rsyslog server to use a TLS encrypted connection to receive logs, which 1 or more clients can send logs to. We've also configured /var/log on both the server and the client to avoid awkward issues.

Moving forwards, I recommend reading my lnav basics tutorial blog post, which should be rather helpful in analysing the resulting log files.

lnav was not helpful however when I asked it to look at all the log files separately with sudo lnav */*.log, deciding to treat them as "generic logs" rather than "syslog logs", meaning that it didn't colour them properly, and also didn't allow for proper filter. To this end, it may be benefical to store all the logs in 1 file rather than in separate files. I'll keep an eye on this, and update this post if figure out how to convince lnav to treat them properly.

Another slightly snag with my approach here is that for some reason all the logs from elsewhere also end up in the generic /var/log/syslog file (hence how I found a 'workaround' the above issue), resulting in duplicated logs. I have yet to find a solution to this issue, but I'm also not sure whether I want to keep the logs in 1 big file or in many smaller files yet.

These issues aside, I'm pretty satisfied with the results. Together with my existing collectd-based monitoring system (which I'll blog about how I've set that up if there's any interest - collectd is really easy to use), this is another step towards greater transparency into the infrastructure I manage.

In the future, I want to investigate generating notifications alerts for issues in my infrastructure. These could come either from collectd, or from rsyslog, and I envision them going to a variety of places:

Email (a daily digest perhaps?)
XMPP (I've bridged to it from shell scripts before)

Given that my infrastructure is just something I run at home and I don't mind so much if it's down for a few hours, my focus here is not on notifying my as soon as possible, but notifying myself in a way that doesn't disturb me so I can check into it in my own time.

If you found this tutorial / guide useful, please do comment below! It's really cool and motivating to see that the stuff I post on here helps others out.

Sources and further reading

Encrypting Syslog Traffic with TLS (SSL) [short version] - in implementing centralised logging and writing this blog post, I largely followed this tutorial
Encrypting Syslog Traffic with TLS (SSL) [long version]
rsyslog docs
rsyslog list of modules
Let's Encrypt root certificates list
lnav basics tutorial
lnav
Collectd
Bridging the gap between XMPP and shell scripts
ELK stack installation tutorial on Digital Ocean (elasticsearch, logstash, and kibana)
Btrfs

Using whiptail for text-based user interfaces

One of my ongoing projects is to implement a Bash-based raspberry pi provisioning system for hosts in my raspberry pi cluster. This is particularly important given that Debian 11 bullseye was released a number of months ago, and while it is technically possible to upgrade a host in-place from Debian 10 buster to Debian 11 bullseye, this is a lot of work that I'd rather avoid.

In implementing a Bash-based provisioning system, I'll have a system that allows me to rapidly provision a brand-new DietPi (or potentially other OSes in the future, but that's out-of-scope of version 1) automatically. Once the provisioning process is complete, I need only reboot it and potentially set a static IP address on my router and I'll then have a fully functional cluster host that requires no additional intervention (except to update it regularly of course).

The difficulty here is I don't yet have enough hosts in my cluster that I can have a clear server / worker division, since my Hashicorp Nomad and Consul clusters both have 3 server nodes for redundancy rather than 1. It is for this reason I need a system in my provisioning system that can ask me what configuration I want the new host to have.

To do this, I rediscovered the whiptail command, which is installed by default on pretty much every system I've encountered so far, and it allows you do develop surprisingly flexible text based user interfaces with relatively little effort, so I wanted to share it here.

Unfortunately, while it's very cool and also relatively easy to use, it also has a lot of options and can result in command invocations like this:

whiptail --title "Some title" --inputbox "Enter a hostname:" 10 40 "default_value" 3>&1 1>&2 2>&3;

...and it only gets more complicated from here. In particular the 2>&1 1>&2 2>&3 bit there is a fancy way of flipping the standard output and standard error.

I thought to myself that surely there must be a way that I can simplify this down to make it easier to use, so I implemented a number of wrapper functions:

ask_yesno() {
    local question="$1";

    whiptail --title "Step ${step_current} / ${step_max}" --yesno "${question}" 40 8;
    return "$?"; # Not actually needed, but best to be explicit
}

This first one asks a simple yes/no question. Use it like this:

if ask_yesno "Some question here"; then
    echo "Yep!";
else
    echo "Nope :-/";
fi

Next up, to ask the user for a string of text:

# Asks the user for a string of text.
# $1    The window title.
# $2    The question to ask.
# $3    The default text value.
# Returns the answer as a string on the standard output.
ask_text() {
    local title="$1";
    local question="$2";
    local default_text="$3";
    whiptail --title "${title}" --inputbox "${question}" 10 40 "${default_text}" 3>&1 1>&2 2>&3;
    return "$?"; # Not actually needed, but best to be explicit
}

# Asks the user for a password.
# $1    The window title.
# $2    The question to ask.
# $3    The default text value.
# Returns the answer as a string on the standard output.
ask_password() {
    local title="$1";
    local question="$2";
    local default_text="$3";
    whiptail --title "${title}" --passwordbox "${question}" 10 40 "${default_text}" 3>&1 1>&2 2>&3;
    return "$?"; # Not actually needed, but best to be explicit
}

These both work in the same way - it's just that with ask_password it uses asterisks instead of the actual characters the user is typing to hide what they are typing. Use them like this:

new_hostname="$(ask_text "Provisioning step 1 / 4" "Enter a hostname:" "${HOSTNAME}")";
sekret="$(ask_password "Provisioning step 2 / 4" "Enter a sekret:")";

The default value there is of course optional, since in Bash if a variable does not hold a value it is simply considered to be empty.

Finally, I needed a mechanism to ask the user to choose at most 1 value from a predefined list:

# Asks the user to choose at most 1 item from a list of items.
# $1        The window title.
# $2..$n    The items that the user must choose between.
# Returns the chosen item as a string on the standard output.
ask_multichoice() {
    local title="$1"; shift;
    local args=();
    while [[ "$#" -gt 0 ]]; do
        args+=("$1");
        args+=("$1");
        shift;
    done
    whiptail --nocancel --notags --menu "$title" 15 40 5 "${args[@]}" 3>&1 1>&2 2>&3;
    return "$?"; # Not actually needed, but best to be explicit
}

This one is a bit special, as it stores the items in an array before passing it to whiptail. This works because of word splitting, which is when the shell will substitute a variable with it's contents before splitting the arguments up. Here's how you'd use it:

choice="$(ask_multichoice "How should I install Consul?" "Don't install" "Client mode" "Server mode")";

As an aside, the underlying mechanics as to why this works is best explained by example. Consider the following:

oops="a value with spaces";

node src/index.mjs --text $oops;

Here, we store value we want to pass to the --text argument in a variable. Unfortunately, we didn't quote $oops when we passed it to our fictional Node.js script, so the shell actually interprets that Node.js call like this:

node src/index.mjs --text a value with spaces;

That's not right at all! Without the quotes around a value with spaces there, process.argv will actually look like this:

[
    '/usr/local/lib/node/bin/node',
    '/tmp/test/src/index.mjs',
    '--text',
    'a',
    'value',
    'with',
    'spaces'
]

The a value with spaces there has been considered by the Node.js subprocess as 4 different values!

Now, if we include the quotes there instead like so:

oops="a value with spaces";

node src/index.mjs --text "$oops";

...the shell will correctly expand it to look like this:

node src/index.mjs --text "a value with spaces";

... which then looks like this to our Node.js subprocess:

[
    '/usr/local/lib/node/bin/node',
    '/tmp/test/src/index.mjs',
    '--text',
    'a value with spaces'
]

Much better! This is important to understand, as when we start talking about arrays in Bash things start to work a little differently. Consider this example:

items=("an apple" "a banana" "an orange")

/tmp/test.mjs --text "${item[@]}"

Can you guess what process.argv will look like? The result might surprise you:

[
    '/usr/local/lib/node/bin/node',
    '/tmp/test.mjs',
    '--text',
    'an apple',
    'a banana',
    'an orange'
]

Each element of the Bash array has been turned into a separate item - even when we quoted it and the items themselves contain spaces! What's going on here?

In this case, we used [@] when addressing our items Bash array, which causes Bash to expand it like this:

/tmp/test.mjs --text "an apple" "a banana" "an orange"

....so it quotes each item in the array separately. If we forgot the quotes instead like this:

/tmp/test.mjs --text ${item[@]}

...we would get this in process.argv:

[
    '/usr/local/lib/node/bin/node',
    '/tmp/test.mjs',
    '--text',
    'an',
    'apple',
    'a',
    'banana',
    'an',
    'orange'
]

Here, Bash still expands each element separately, but does not quote each item. Because each item isn't quoted, when the command is actually executed, it splits everything a second time!

As a side note, if you want all the items in a Bash array in a single quoted item, you need to use an asterisk * instead of an at-sign @ like so:

/tmp/test.mjs --text "${a[*]}";

....which would yield the following process.argv:

[
    '/usr/local/lib/node/bin/node',
    '/tmp/test.mjs',
    '--text',
    'an apple a banana an orange'
]

With that, we have a set of functions that make whiptail much easier to use. Once it's finished, I'll write a post on my Bash-based cluster host provisioning script and explain my design philosophy behind it and how it works.

Cluster, Part 12: TLS for Breakfast | Configuring Fabio for HTTPS

Hey there, and happy new year 2022! It's been a little while, but I'm back now with another blog post in my cluster series. In this shorter post, I'm going to show you how I've configured my Fabio load balancer to serve HTTPS.

Before we get started though, I can recommend visiting the series list to check out all the previous parts in this series, as a number of them give useful context for this post.

In the last post, I showed you how to setup certbot / let's encrypt in a Docker container. Building on this, we can now reconfigure Fabio (which we setup in part 9) to take in the TLS certificates we are now generating. I'll be assuming that the certificates are stored on your NFS share you've got setup (see part 8) for this post. In the future I'd love to use Hashicorp Vault for storing these certificates, but as of now I've found Hashicorp Vault to be far too complicated to setup, so I'll be using the filesystem instead.

Configuring Fabio to use HTTPS is actually really quite simple. Open /etc/fabio/fabio.properties for editing, and at the beginning insert a line like this:

proxy.cs = cs=some_name_here;type=file;cert=/absolute/path/to/fullchain.pem;key=/absolute/path/to/privkey.pem

cs stands for certificate store, and this tells Fabio about where your certificates are located. some_name_here is a name you'd like to assign to your certificate store - this is used to reference it elsewhere in the configuration file. /absolute/path/to/fullchain.pem and /absolute/path/to/privkey.pem are the absolute paths to the fullchaim.pem and privkey.pem files from Let's Encrypt. These can be found in the live directory in the Let's Encrypt configuration directory in the subdirectory for the domain in question.

Now that Fabio knows about your new certificates, find the line that starts with proxy.addr. In the last tutorial, we configured this to have a value of :80;proto=http. proxy.addr can take a comma-separated list of ports to listen on, so append the following to the existing value:

:443;proto=https;cs=some_name_here;tlsmin=tls12

This tells Fabio to listen on TCP port 443 for HTTPS requests, and also tells it which certificate store to use for encryption. We also set the minimum TLS version supported to TLS 1.2 - but you should set this value to 1 version behind the current latest version (check this page for that). For those who want extra security, you can also add the tlsciphers="CIPHER,LIST" argument too (see the official documentation for more information - cross referencing it with the ssl-config.mozilla.org is a good idea).

Now that we have this configured, this should be all you need to enable HTTPS! That was easy, right?

We still have little more work to do though to make HTTPS the default and to redirect all HTTP requests to HTTPS. We can do this by adding a route to the Consul key-value store under the path fabio/config. You can do this either by editing it in the web interface by creating a new key under fabio/config and pasting the following in & saving it:

route add route_name_here example.com:80 https://example.com$path opts "redirect=308"

Alternatively, through the command line:

consul kv put fabio/config/some_name_here 'route add some_name_here example.com:80 https://example.com$path opts "redirect=308"'

No need to restart fabio - it should pick routes up automatically. I have found however that I do need to restart it occasionally if it doesn't pick up some changed routes as fast as I'd like though.

With this, we now have automatic HTTPS setup and configured! Coming up in this series:

Using Caddy as an entrypoint for port forwarding on my router (status: implemented; there's an awesome plugin for single sign-on, and it's amazing in other ways too) - this replaces the role HAProxy was going to play that I mentioned in part 11
Password protecting Docker, Nomad, and Consul (status: on the todo list)
Semi-automatic docker image rebuilding with Laminar CI (status: implemented)

Sources and further reading

Cluster, Part 11: Lock and Key | Let's Encrypt DNS-01 for wildcard TLS certificates

Welcome one and all to another cluster blog post! Cluster blog posts always take a while to write, so sorry for the delay. As is customary, let's start this post off with a list of all the parts in the series so far:

With that out of the way, in this post we're going to look at obtaining a wildcard TLS certificate using the Let's Encrypt DNS-01 challenge. We want this because you need a TLS certificate to serve HTTPS without lighting everyone's browsers up with warnings like a Christmas tree.

The DNS-01 challenge is an alternate challenge to the default HTTP-01 challenge you may already me familiar with.

Unlike the HTTP-01 challenge which proves you have access to single domain by automatically placing a file on your web server, the DNS-01 challenge proves you have control over an entire domain - thus allowing you to obtain a wildcard certificate - which is valid for not only your domain, but all possible subdomains! This should save a lot of hassle - but it's important we keep it secure too.

As with regular Let's Encrypt certificates, we'll also need to ensure that our wildcard certificate we obtain will be auto-renewed, so we'll be setting up a periodic task on our Nomad cluster to do this for us.

If you don't have a Nomad cluster, don't worry. It's not required, and I'll be showing you how to do it without one too. But if you'd like to set one up, I recommend part 7 of this series.

In order to complete the DNS-01 challenge successfully, we need to automatically place a DNS record in our domain. This can be done via an API, if your DNS provider has one and it's supported. Personally, I have the domain name I'm using for my cluster (mooncarrot.space.) with Gandi. We'll be using certbot to perform the DNS-01 challenge, which has a plugin system for different DNS API providers.

We'll be installing the challenge provider we need with pip3 (a Python 3 package manager, as certbot is written in Python), so you can find an up-to-date list of challenge providers over on PyPi here: https://pypi.org/search/?q=certbot-dns

If you don't see a plugin for your provider, don't worry. I couldn't find one for Gandi, so I added my domain name to Cloudflare and followed the setup to change the name servers for my domain name to point at them. After doing this, I can now use the Cloudflare API through the certbot-dns-cloudflare plugin.

With that sorted, we can look at obtaining that TLS certificate. I opt to put certbot in a Docker container here so that I can run it through a Nomad periodic task. This proved to be a useful tool to test the process out though, as I hit a number of snags with the process that made things interesting.

The first order of business is to install certbot and the associate plugins. You'd think that simply doing an sudo apt install certbot certbot-dns-cloudflare would do the job, but you'd be wrong.

As it turns out, it does install that way, but it installs an older version of the certbot-dns-cloudflare plugin that requires you give it your Global API Key from your Cloudflare account, which has permission to do anything on your account!

That's no good at all, because if the key gets compromised an attacker could edit any of the domain names on our account they like, which would quickly turn into a disaster!

Instead, we want to install the latest version of certbot and the associated Cloudflare DNS plugin, which support regular Cloudflare API Tokens, upon which we can set restrictive permissions to only allow it to edit the one domain name we want to obtain a TLS certificate for.

I tried multiple different ways of installing certbot in order to get a version recent enough to get it to take an API token. The way that worked for me was a script called certbot-auto, which you can download from here: https://dl.eff.org/certbot-auto.

Now we have a way to install certbot, we also need the Cloudflare DNS plugin. As I mentioned above, we can do this using pip3, a Python package manager. In our case, the pip3 package we want is certbot-dns-cloudflare - incidentally it has the same name as the outdated apt package that would have made life so much simpler if it had supported API tokens.

Now we have a plan, let's start to draft out the commands we'll need to execute to get certbot up and running. If you're planning on following this tutorial on bare metal (i.e. without Docker), go ahead and execute these directly on your target machine. If you're following along with Docker though, hang on because we'll be wrapping these up into a Dockerfile shortly.

First, let's install certbot:

sudo apt install curl ca-certificates
cd some_permanent_directory;
curl -sS https://dl.eff.org/certbot-auto -o certbot-auto
chmod +x certbot-auto
sudo certbot-auto --debug --noninteractive --install-only

Installation with certbot-auto comprises downloading a script and executing it. with a bunch of flags. Next up, we need to shoe-horn our certbot-dns-cloudflare plugin into the certbot-auto installation. This requires some interesting trickery here, because certbot-auto uses something called virtualenv to install itself and all its dependencies locally into a single directory.

sudo apt install python3-pip
cd /opt/eff.org/certbot/venv
source bin/activate
pip install certbot-dns-cloudflare
deactivate

In short, we cd into the certbot-auto installation, activate the virtualenv local environment, install our dns plugin package, and then exit out of the virtual environment again.

With that done, we can finally add a convenience synlink so that the certbot command is in our PATH:

ln -s /opt/eff.org/certbot/venv/bin/certbot /usr/bin/certbot

That completes the certbot installation process. Then, to use certbot to create the TLS certificate, we'll need an API as mentioned earlier. Navigate to the API Tokens part of your profile and create one, and then create an INI file in the following format:

# Cloudflare API token used by Certbot
dns_cloudflare_api_token = "YOUR_API_TOKEN_HERE"

...replacing YOUR_API_TOKEN_HERE with your API token of course.

Finally, with all that in place, we can create our wildcard certificate! Do that like this:

sudo certbot certonly --dns-cloudflare --dns-cloudflare-credentials path/to/credentials.ini -d 'bobsrockets.io,*.bobsrockets.io' --preferred-challenges dns-01

It'll ask you a bunch of interactive questions the first time you do this, but follow it through and it should issue you a TLS certificate (and tell you where it stored it). Actually utilising it is beyond the scope of this post - we'll be tackling that in a future post in this series.

For those following along on bare metal, this is where you'll want to skip to the end of the post. Before you do, I'll leave you with a quick note about auto-renewing your TLS certificates. Do this:

sudo letsencrypt renew
sudo systemctl reload nginx postfix

....on a regular basis, replacing nginx postfix with a space-separated list of services that need reloading after you've renewed your certificates. A great way to do this is to setup a cron job.

Sweeping things under the carpet

For the Docker users here, we aren't quite finished yet: We need to package this mess up into a nice neat Docker container where we can forget about it :P

Some things we need to be aware of:

certbot has a number of data directories it interacts with that we need to ensure don't get wiped when the Docker ends instances of our container.
Since I'm serving the shared storage of my cluster over NFS, we can't have certbot running as root as it'll get a permission denied error when it tries to access the disk.
While curl and ca-certificates are needed to download certbot-auto, they aren't needed by certbot itself - so we can avoid installing them in the resulting Docker container by using a multi-stage Dockerfile.

To save you the trouble, I've already gone to the trouble of developing just such a Dockerfile that takes all of this into account. Here it is:

ARG REPO_LOCATION
# ARG BASE_VERSION

FROM ${REPO_LOCATION}minideb AS builder

RUN install_packages curl ca-certificates \
    && curl -sS https://dl.eff.org/certbot-auto -o /srv/certbot-auto \
    && chmod +x /srv/certbot-auto

FROM ${REPO_LOCATION}minideb

COPY --from=builder /srv/certbot-auto /srv/certbot-auto

RUN /srv/certbot-auto --debug --noninteractive --install-only && \
    install_packages python3-pip

WORKDIR /opt/eff.org/certbot/venv
RUN . bin/activate \
    && pip install certbot-dns-cloudflare \
    && deactivate \
    && ln -s /opt/eff.org/certbot/venv/bin/certbot /usr/bin/certbot

VOLUME /srv/configdir /srv/workdir /srv/logsdir

USER 999:994
ENTRYPOINT [ "/usr/bin/certbot", \
    "--config-dir", "/srv/configdir", \
    "--work-dir", "/srv/workdir", \
    "--logs-dir", "/srv/logsdir" ]

A few things to note here:

We use a multi-stage dockerfile here to avoid installing curl and ca-certificates in the resulting docker image.
I'm using minideb as a base image that resides on my private Docker registry (see part 8). For the curious, the script I use to do this located on my personal git server here: https://git.starbeamrainbowlabs.com/sbrl/docker-images/src/branch/master/images/minideb.
- If you don't have minideb pushed to a private Docker registry, replace minideb with bitnami/minideb in the above.
We set the user and group certbot runs as to 999:994 to avoid the NFS permissions issue.
We define 3 Docker volumes /srv/configdir, /srv/workdir, and /srv/logsdir to contain all of certbot's data that needs to be persisted and use an elaborate ENTRYPOINT to ensure that we tell certbot about them.

Save this in a new directory with the name Dockerfile and build it:

sudo docker build --no-cache --pull --tag "certbot" .;

...if you have a private Docker registry with a local minideb image you'd like to use as a base, do this instead:

sudo docker build --no-cache --pull --tag "myregistry.seanssatellites.io:5000/certbot" --build-arg "REPO_LOCATION=myregistry.seanssatellites.io:5000/" .;

In my case, I do this on my CI server:

laminarc queue docker-rebuild IMAGE=certbot

The hows of how I set that up will be the subject of a future post. Part of the answer is located in my docker-images Git repository, but the other part is in my private continuous integration Git repo (but rest assured I'll be talking about it and sharing it here).

Anyway, with the Docker container built we can now obtain our certificates with this monster of a one-liner:

sudo docker run -it --rm -v /mnt/shared/services/certbot/workdir:/srv/workdir -v /mnt/shared/services/certbot/configdir:/srv/configdir -v /mnt/shared/services/certbot/logsdir:/srv/logsdir certbot certonly --dns-cloudflare --dns-cloudflare-credentials path/to/credentials.ini -d 'bobsrockets.io,*.bobsrockets.io' --preferred-challenges dns-01

The reason this is so long is that we need to mount the 3 different volumes into the container that contain certbot's data files. If you're running a private registry, don't forget to prefix certbot there with registry.bobsrockets.com:5000/.

Don't forget also to update the Docker volume locations on the host here to point a empty directories owned by 999:994.

Even if you want to run this on Nomad, I still advise that you execute this manually. This is because the first time you do so it'll ask you a bunch of questions interactively (which it doesn't do on subsequent times).

If you're not using Nomad, this is the point you'll want to skip to the end. As before with the bare-metal users, you'll want to add a cron job that runs certbot renew - just in your case inside your Docker container.

Nomad

For the truly intrepid Nomad users, we still have one last task to complete before our work is done: Auto-renewing our certificate(s) with a Nomad periodic task.

This isn't really that complicated I found. Here's what I came up with:

job "certbot" {
    datacenters = ["dc1"]
    priority = 100
    type = "batch"

    periodic {
        cron = "@weekly"
        prohibit_overlap = true
    }

    task "certbot" {
        driver = "docker"

        config {
            image = "registry.service.mooncarrot.space:5000/certbot"
            labels { group = "maintenance" }
            entrypoint = [ "/usr/bin/certbot" ]
            command = "renew"
            args = [
                "--config-dir", "/srv/configdir/",
                "--work-dir", "/srv/workdir/",
                "--logs-dir", "/srv/logsdir/"
            ]
            # To generate a new cert:
            # /usr/bin/certbot --work-dir /srv/workdir/ --config-dir /srv/configdir/ --logs-dir /srv/logsdir/ certonly --dns-cloudflare --dns-cloudflare-credentials /srv/configdir/__cloudflare_credentials.ini -d 'mooncarrot.space,*.mooncarrot.space' --preferred-challenges dns-01

            volumes = [
                "/mnt/shared/services/certbot/workdir:/srv/workdir",
                "/mnt/shared/services/certbot/configdir:/srv/configdir",
                "/mnt/shared/services/certbot/logsdir:/srv/logsdir"
            ]
        }
    }
}

If you want to use it yourself, replace the various references to things like the private Docker registry and the Docker volumes (which require "docker.volumes.enabled" = "True" in client → options in your Nomad agent configuration) with values that make sense in your context.

I have some confidence that this is working as intended by inspecting logs and watching TLS certificate expiry times. Save it to a file called certbot.nomad and then run it:

nomad job run certbot.nomad

Conclusion

If you've made it this far, congratulations! We've installed certbot and used the Cloudflare DNS plugin to obtain a DNS wildcard certificate. For the more adventurous, we've packaged it all into a Docker container. Finally for the truly intrepid we implemented a Nomad periodic job to auto-renew our TLS certificates.

Even if you don't use Docker or Nomad, I hope this has been a helpful read. If you're interested in the rest of my cluster build I've done, why not go back and start reading from part 1? All the posts in my cluster series are tagged with "cluster" to make them easier to find.

Unfortunately, I haven't managed to determine a way to import TLS certificates into Hashicorp Vault automatically, as I've stalled a bit on the Vault front (permissions and policies are wildly complicated), so in future posts it's unlikely I'll be touching Vault any time soon (if anyone has an alternative that is simpler and easier to understand / configure, please comment below).

Despite this, in future posts I've got a number of topics lined up I'd like to talk about:

Configuring Fabio (see part 9) to serve HTTPS and force-redirect from HTTP to HTTPS (status: implemented)
Implementing HAProxy to terminate port forwarding (status: initial research)
Password protecting the private docker registry, Consul, and Nomad (status: on the todo list)
Semi-automatic docker image rebuilding with Laminar CI (status: implemented)

In the meantime, please comment below if you liked this post, are having issues, or have any suggestions. I'd love to hear if this helped you out!

Sources and Further Reading

Let's Encrypt Challenge Types

consulstatus: public status pages drawn from Consul

In my cluster series of blog posts, I've been talking about how I've been building my cluster from scratch. Now that I've got it into some sorta stable state (though I'm still working on it), one of the things I discovered might be helpful for other users of my cluster is a status page.

(Above: The logo for consulstatus. Consulstatus is written by me and not endorsed by Hashicorp or the Consul project.)

To this end, I ended up implementing a quick solution to this problem in PHP. Here's a screenshot of what it looks like:

Screenshot of what consulstatus looks like. See explanation below

The colour scheme changes depending on your browser's prefers-colour-scheme. The circles to the right of each service are either green (indicating no issues), yellow (some problems are occurring), or red (it's down and everything's terrible))

As the name suggests, it's backed by Hashicorp Consul (which I blogged about in cluster, part 6: superglue service discovery). I recommend reading my blog post about it, but in short Consul allows you to register services that it should keep track of, and checks that define whether said services are healthy or not.

It supports a TOML config file that allows you to specify where Consul is, along with the names of the services you'd like to display:

title = "Cluster status page"

[consul]

base_url = "http://consul.service.bobsrockets.com:8500"


services = [
    "some_service",
    "another_service"
    # .....
]

The status page is designed to be as simple to understand to understand as possible, so that anyone (even those who aren't technically skilled) can get an idea as to what is working and what isn't at any given time.

So far, it's been moderately successful. The status page itself is stable and behaves expectedly (which is always a plus), and it does reflect the status of the services in question.

I did initially toy with the idea of exposing more information about the specific checks in Consul that have failed, but then I thought that I'd be then doing what the Consul web interface already does, which seems a bit pointless.

Instead, I decided to keep it rather minimalist instead, such that it could be exposed publicly (in theory, though my instance is only accessible on my local LAN) in a way that the main Consul web interface really can't.

Moving forwards, I'm quite happy with consulstatus as-is, so if I make any changes they aren't likely to be too drastic. I'd like to look at adding a description to each service so that it's more obvious what it is, or maybe have display names that are shown instead of the Consul service names.

I'd also maybe like to display an icon to the left of each service as well to further help with visual identification and understanding, and perhaps allow grouping services too.

Out of scope though is logging service status history. That can be done elsewhere if desired (and I don't particularly have a need for that) - and PHP isn't particularly suited to that anyway.

Found this interesting? Got a suggestion? Comment below!

Cluster, Part 10: Dockerisification | Writing Dockerfiles

Hey there - welcome to 2021! I'm back with another cluster post. In double digits too! I think this is the longest series yet on my blog. Before we start, here's a list of all the posts in the series so far:

We've got a pretty cool setup going so far! With Nomad for task scheduling (part 7), Consul to keep track of what's running where (part 6), and wesher keeping communications secured (part 4, although defence in depth says that we'll be returning later to shore up some stuff here) we have a solid starting point from which to work from. And it's only taken 9 blog posts to get to this point :P

In this post, we'll be putting all our hard work to use by looking at the basics of writing Dockerfiles. It's taken me quite a while to get my head around them, so I want to take a moment here to document some of the things I've learnt. A few other things that I want to talk about soon are Hashicorp Vault (it's still giving me major headaches trying to understand the Nomad integration though, so this may be a while), obtaining TLS certificates, and tying in with the own your code series by showing off the Docker image management script setup I have that I've worked into my Laminar CI instance, which makes it easy to rebuild images and all their dependants.

Anyway, Dockerfiles. First question: what? Dockerfiles are essentially a file containing a domain-specific language that defines how a Docker image can be built. They are usually named Dockerfile. Here I use the term image and not container:

Image: A Docker image that contains a bunch of files and directories that can be run
Container: A copy of an image that is currently running on a host system.

In short: A container is a running image, and a Docker image is the bit that a container spins up from.

Second question: why? The answer is a few different reasons. Although it adds another layer of indirection and complication, it also allows us to square applications away such that we don't care about what host they run on (too much).

A great example here is would be a static file web server. In our case, this is particularly useful because Fabio - as far as I know - isn't actually capable of serving files from disk. Personally I have a fork of a rather nice dashboard I'd like to have running for my cluster too, so I found that it fits perfectly to test the waters.

Next question: how? Well, let's break the process down:

Install Node.js
Install the serve npm package

Thankfully, I've recently packaged Node.js in my apt repository (finally! It's only taken me multiple years.....). Since we might want to build lots of different Node.js based container images, it makes sense to make Node.js its own separate container. I'm also using my apt repository in other container images too which don't necessarily need Node.js, so I've opted to put my apt repository into my base image (If I haven't mentioned it already, I'm using minideb as my base image - which I build with a patch to make it support Raspbian - which is now called Raspberry Pi OS. It's confusing).

To better explain the plan, let's use a diagram:

(Above: A diagram I created. Link to editing file - don't forget this blog is licenced under CC-BY-SA.)

Docker images are always based on another Docker image. Our node-serve Docker image we intend to create will be based on a minideb-node Docker image (which we'll also be creating), which itself will be based on the minideb base image. Base images are special, as they don't have a parent image. They are usually imported via a .tar.gz image for example, but that's a story for another time (also for another time are image based on scratch, a special image that's completely empty).

We'll then push the final node-serve Docker image to a Docker registry. I'm running my own private Docker registry, but you can use the Docker Hub or setup your own private Docker registry.

With this in mind, let's start with a Docker image for Node.js:

ARG REPO_LOCATION

FROM ${REPO_LOCATION}minideb

RUN install_packages libatomic1 nodejs-sbrl

Let's talk about each of the above commands in turn:

ARG REPO_LOCATION: This brings in an argument which is specified at build time. Here we want to allow the user to specify the location of a private Docker registry to pull the base (or parent) image from to begin the build process with.
FROM ${REPO_LOCATION}minideb: This specifies the base (or parent) image to start the build with.
RUN install_packages libatomic1 nodejs-sbrl: The RUN command runs the specified command inside the Docker container, saving a new layer in the process (more on those later). In this case, we call the install_packages command, which is a helper script provided by minideb to make package installation easier.

Pretty simple! This assumes that the minideb base image you're using has my apt repository setup, which make not be the case. To this end, we'd like to automatically set that up. To do this, we'll need to use an intermediate image. This took me some time too get my head around, so if you're unsure about anything, please comment below.

Let's expand on our earlier attempt at a Dockerfile:

ARG REPO_LOCATION

FROM ${REPO_LOCATION}minideb AS builder

RUN install_packages curl ca-certificates

RUN curl -o /srv/sbrl.asc https://apt.starbeamrainbowlabs.com/aptosaurus.asc

FROM ${REPO_LOCATION}minideb

COPY --from=builder /srv/sbrl.asc /etc/apt/trusted.gpg.d/sbrl-aptosaurus.asc

RUN echo "deb https://apt.starbeamrainbowlabs.com/ /" > /etc/apt/sources.list.d/sbrl.list && \
    install_packages libatomic1 nodejs-sbrl;

This one is more complicated, so let's break it down. Here, we have an intermediate Docker image (which we name builder via the AS builder bit at the end of the 1st FROM) in which we download and install curl (the 1st RUN command there), followed by a second image in which we copy the file we downloaded from the first Docker image and place it in a specific place in the second (the COPY directive).

Docker always reads Dockerfiles from top to bottom and executes them in sequence, so it will assume that the last image created is the final one - i.e. from the last FROM directive. Every FROM directive starts afresh from a brand-new copy of the specified parent image.

We've also expanded the RUN directive at the end of the file there to echo the apt sources list file out for my apt repository. We've done it like this in a single RUN command and not 2, because every time you add another directive to a Dockerfile (except ARG and FROM), it creates a new layer in the resulting Docker image. Minimising the number of layers in a Docker image is important for performance, hence the obscurity here in chaining commands together. To build our new Dockerfile, save it to a new empty directory. Then, execute this:

cd path/to/directory/containing_the_dockerfile;
docker build  --pull --tag "minideb-node" .

If you're using a private registry, add --build-arg "REPO_LOCATION=registry.example.com:5000/" just before the . there at the end of the command and prefix the tag with registry.example.com:5000/. If you're developing a new Docker image and having trouble with the cache (Docker caches the result of directives when building images), add --no-cache.

Then, push it to the Docker registry like so:

execute docker push "minideb-node"

Again, prefix minideb-node there with registry.example.com:5000/ should you be using a private Docker registry.

Now, you should be able to start an interactive session inside your new Docker container:

docker run -it --rm minideb-node

As before, prefix minideb-node there with registry.example.com/ if you're using a private Docker registry.

Now that we've got our Docker image for Node.js, we can write another Dockerfile for serve, our static file HTTP server. Let's take a look:

ARG REPO_LOCATION

FROM ${REPO_LOCATION}minideb-node

RUN npm install --global serve && rm -rf "$(npm get cache)";

VOLUME [ "/srv" ]

USER 80:80

ENV NODE_ENV production
WORKDIR /srv
ENTRYPOINT [ "serve", "-l", "5000" ]

This looks similar to the previous Dockerfile, but with a few extra bits added on. Firstly, we use a RUN directive to install the serve npm package and delete the NPM cache in a single command (since we don't want the npm cache sticking around in the final Docker image).

We then use a VOLUME declaration to tell Docker that we expect the /srv to have a volume mounted to it. A volume here is a directory from the host system that will be mounted into the Docker container before it starts running. In this case, it's the web root that we'll be serving files from.

A USER directive tells Docker what user and group IDs we want to run all subsequent commands as. This is important, as it's a bad idea to run Docker containers as root.

The ENV directive there is just to tell Node.js it should run in production mode. Some Node.js applications have some optimisations they enable when this environment variable is set.

The WORKDIR directive defines the current working directory for future commands. It functions like the cd command in your terminal or command line. In this case, the serve npm package always serves from the current working directory - hence we set the working directory here.

Finally, the ENTRYPOINT directive tells Docker what command to execute by default. The ENTRYPOINT can get quite involved and complex, but we're keeping it simple here and telling it to execute the serve command (provided by the serve npm package, which we installed globally earlier in the Dockerfile). We also specify the port number we want serve to listen on with -l 5000 there.

That completes the Dockerfile for the serve npm package. Build it as before, and then you should be able to run it like so:

docker run -it --rm -v /absolute/path/to/local_dir:/srv node-serve

As before, prefix node-serve with the address of your private Docker registry if you're using one. The -v bit above defines the Docker volume that mounts the webroot directory inside the Docker container.

Then, you should be able to find the IP address of the Docker container and enter it into your web browser to connect to the running server!

The URL should be something like this: http://IP_ADDRESS_HERE:5000/.

If you're not running Docker on the same machine as your web browser is running on, then you'll need to do some fancy footwork to get it to display. It's at this point that I write a Nomad job file, and wire it up to Fabio my load balancer.

In the next post, we'll talk more about Fabio. We'll also look at the networking and architecture that glues the whole system together. Finally, we'll look at setting up HTTPS with Let's Encrypt and the DNS-01 challenge (which I found relatively simple - but only once I'd managed to install a new enough version of certbot - which was a huge pain!).

Cluster, Part 9: The Border Between | Load Balancing with Fabio

Hello again! It's been a while since the last one (mainly since I've been unsure about a few architectural things), but I'm now ready to continue writing about my setup. Before we continue, here's a refresher of everything we've done so far:

In this post, we're going to look at tying off our primary pipeline. So far, we've got job scheduling with Nomad, (superglue!) service discovery with Consul, and shared storage backed with NFS (although I'm going to revisit this eventually), with everything underpinned by a WireGuard mesh VPN with wesher.

In order to allow people to interact with services that are running on the cluster, we need something that will translate from the weird and strange world of anything running somewhere anywhere, and everywhere in-between into something that makes sense from an outside perspective. We want to have a single gateway by which we can control and manage access.

It is for these purposes that we're going to add Fabio to our stack. Its configuration is backed by Consul, and it is relatively simple and easy to understand. Having the config backed by Consul nets us multiple benefits:

It can run anywhere on the cluster we like in a pinch
We can configure new routes directly from a Nomad job spec file (although we still need to update the Unbound config)
The configuration of Vault gains additional data redundancy being stored on multiple nodes in the cluster

Like in previous parts of this series, Fabio isn't available to install with apt directly, so I've packaged it into my apt repository. If you haven't yet set up my apt repository, up-to-date instructions on how to do so can be found at the top of its main page - just click the aforementioned link (I'm not going to include instructions here, as they may go out of date at a later time).

Once you've set up my apt repository (or downloaded the Fabio binary manually, though I don't recommend that as it's more difficult to keep up-to-date), we can install Fabio like so:

sudo apt install fabio

This should be done on your primary (controller) node in your cluster. You can also do it on a secondary node too if you like to increase redundancy. To do this, just follow these instructions on both nodes one at a time. I'll be doing this soon myself: I've just been distracted with other things :P

Next, we need a service file. For systemd users (I'm using Raspbian at the moment), I have an apt package:

sudo apt install fabio-systemd

With this installed, we need to create a (very) minimal configuration file. Here it is:

proxy.addr = :80;proto=http
proxy.auth = name=admin;type=basic;file=/etc/fabio/auth.admin.htpasswd

Pretty short, right? This does 2 things:

Tells Fabio to listen on port 80 for HTTP requests (we'll be tackling HTTPS in a separate post - we need Vault for that)
Tells Fabio about the admin auth realm and where it can find the .htpasswd file that corresponds with it

Fabio's password authentication uses HTTP Basic Auth - which is insecure over unencrypted HTTP. Note that we'll be working towards improving the situation here and I'll insert a reminder when we arrive to change all your passwords where we do, but there are quite a number of obstacles between here and there we have to deal with first.

With this in mind, Take a copy of the above Fabio config file and write it to /etc/fabio/fabio.properties. Next, we need to generate that htpasswd file we reference in the config file. There are many tools out there that can be used for this purpose - for example the htpasswd tool in the apache2-utils package:

htpasswd /etc/fabio/auth.admin.htpasswd username

I like this authentication setup for Fabio, as it allows one to have a single easily configurable set of realms for different purposes if desired.

If you're setting up Fabio on multiple servers, you'll want to put your config file in your shared NFS storage and create a symlink at /etc/fabio/fabio.properties instead. Do that like this:

sudo ln -s /etc/fabio/fabio.properties /mnt/shared/config/fabio/fabio.properties

....update the /mnt/... path accordingly. Don't forget to adjust the /etc/fabio/auth.admin.htpasswd path too in fabio.properties as well.

Now that we've got the configuration file out of the way, we can start Fabio for the first time! Do that like this:

sudo systemctl start fabio.service
sudo systemctl enable fabio.service

Don't forget to punch a hole in the firewall:

sudo ufw allow 80/tcp comment fabio

Fabio is running - but it's not particularly useful, as we haven't configured any routes! Let's add some routes now.The first few routes we're going to add will be manual routes, which will allow us to tell Fabio about a static route we want it to add to it's routing table.

Fabio itself actually has a web interface, which will make a good first target for testing out our new cool toy. I mentioned earlier that Fabio gets its configuration from Consul - and it's now that we're going to take advantage of that. Consul isn't just a service discovery tool you see - it's a shared configuration manager too via a fancy hierarchical distributed key-value data store.

In this datastore Fabio looks in particular at the keys in the fabio directory. Create a new key under here with the Consul CLI like so:

consul kv put "fabio/fabio" 'route add fabio fabio.bobsrockets.com/ http://NODE_NAME.node.mooncarrot.space:9998 tags "mission-control" opts "auth=admin"'

Replace NODE_NAME with the name of the node you're running Fabio on, and yourdomain.com with a domain name you've bought. Once done, update your DNS config to point fabio.bobsrockets.com to the node that's running Fabio (you might want to refer back to my earlier post on Unbound - don't forget to restart unbound with sudo systemctl restart unbound).

When you have your DNS server updated, you should be able to point your browser at fabio.bobsrockets.com. No reloading of Fabio is needed - it picks up changes dynamically and automagically! It should prompt you for your password, and then you should see your the Fabio web interface. It should look something like this:

The Fabio web interface

As you can see, I've got a number of services running - including a few that I'm going to be blogging about soon-ish, such as Vault (but I haven't yet learnt how to use it :P) and Docker Registry UI (which is useful but has some issues - I'm going to see if HTTPS helps fix some of them up as I'm getting some errors in the dev tools about the SubtleCrypto API, which is only available in secure contexts).

Those services with IP addresses as the destination are defined through Nomad, and auto-update based on the host upon which they are running.

In the web interface you can click on overrides on the top bar to view and edit the configuration for the static routes you've got configured. You can't create new ones though, which is a shame.

Using the same technique as described above, you can create manual routes for Nomad and Consul - as they have web interfaces too! If you haven't already you'll need to enable it though with ui = true the Nomad and Consul server configuration files respectively though. For example, you could use these definitions:

route add nomad nomad.seanssatellites.io/ http://nomad.service.seanssatellites.io:4646 tags "mission-control" opts "auth=admin"

route add consul consul.billsboosters.space/ http://consul.service.billsboosters.space:8500 tags "mission-control" opts "auth=admin"

If you do the Consul one first, you can use the web interface to create the definition for Nomad :D

It's perhaps worth making a quick note of some parts of the above route definitions:

opts "auth=admin": This bit activates HTTP Basic Auth with the specified realm
consul.billsboosters.space/: This is the domain through which outside users will access the service. The trailing slash is very important.

From here, the last item on the list for this post are automatic routes via Nomad jobs. Since it's the only job we've got running on Nomad so far, let's use that as an example. Adding a Fabio route in this manner requires 3 steps:

Find the service stanza in your Docker Registry Nomad job file, and edit the tags list to include a pair of tags something like urlprefix-registry.tillystelescopes.fr/ and auth=admin (again, the trailing slash is important, and the urlprefix- bit instructs Fabio that it's the domain name to route traffic from to the container).
Save the edits to the Nomad job file and re-run it with nom job run path/to/file.nomad
Update your DNS with a new record pointing registry.tillystelescopes.fr at the IP address(es) of the node(s) running Fabio

Also pretty simple to get used to, right? From here, step 4 of the official quickstart guide is useful. It explains about the different service tags (like the urlprefix- and auth=admin ones we created above) that are supported. Apparently raw TCP forwarding is also supported - though personally I'm waiting eagerly on UDP forwarding myself for some services I would like to run.

The rest of the Fabio docs are a bit of a mess, but I've found them more understandable than that of Traefik - the solution I investigated before turning to Fabio upon a recommendation from someone over in the r/selfhosted subreddit in frustration (whoever says "Traefik is simple!" is lying - I can't make sense of anything - it might as well be written in hieroglyphs.....).

Looking into the future, our path is diverging into 2 clear routes:

Getting services up and running on our new cluster
Securing said cluster to avoid attack

While relatively separate goals, they do intertwine at intervals. Moving forwards, we're going to be oscillating between these 2 goals. Likely topics include Vault (though it'll take several blog posts to realise any benefit from it at this point), and getting some Docker container infrastructure setup.

Speaking of Docker container infrastructure, if anyone has any ideas as to how to auto-rebuild docker containers and/or auto-restart Nomad jobs to keep them up-to-date, I'd love to know in a comment below. I'm currently scratching my head over that one....

Found this interesting? Got an idea that would improve on my setup? Confused about something? Comment below!

Cluster, Part 8: The Shoulders of Giants | NFS, Nomad, Docker Registry

Welcome back! It's been a bit of a while, but now I'm back with the next part of my cluster series. As a refresher, here's a list of all the parts in the series so far:

In this one, we're going to look at running our first job on our Nomad cluster! If you haven't read the previous posts in this series, you'll probably want to go back and read them now, as we're going to be building on the infrastructure we've setup and the groundwork we've laid in the previous posts in this series.

Before we get to that though, we need to sort out shared storage - as we don't know which node in the cluster tasks will be running on. In my case, I'll be setting up NFS. This is hardly the only solution to the issue though - other options include:

Gluster
Ceph

If you're going to choose NFS like me though, you should be warned that it's neither encrypted not authenticated. You should ensure that NFS is only run on a trusted network. If you don't have a trusted network, use the WireGuard Mesh VPN trick in part 4 of this series.

NFS: Server

Setting up a server is relatively easy. Simply install the relevant package:

sudo apt install nfs-kernel-server

....edit /etc/exports to look something like this:

/mnt/somedrive/subdirectory 10.1.2.0/24(rw,async,no_subtree_check)

/mnt/somedrive/subdirectory is the directory you'd like clients to be able to access, and 10.1.2.0/24 is the IP range that should be allowed to talk to your NFS server.

Next, open up the relevant ports in your firewall (I use UFW):

sudo ufw allow nfs

....and you're done! Pretty easy, right? Don't worry, it'll get harder later on :P

NFS: Client

The client, in theory, is relatively straightforward too. This must be done on all nodes in the cluster - except the node that's acting as the NFS server (although having the NFS server as a regular node in the cluster is probably a bad idea). First, install the relevant package:

sudo apt install nfs-common

Then, update /etc/fstab and add the following line:

10.1.2.10:/mnt/somedrive/subdirectory   /mnt/shared nfs auto,nofail,noatime,intr,tcp,bg,_netdev 0   0

Again, 10.1.2.10 is the IP of the NFS server, and /mnt/somedrive/subdirectory must match the directory exported by the server. Finally, /mnt/shared is the location that we're going to mount the directory from the NFS server to. Speaking of, we should create that directory:

sudo mkdir /mnt/shared

I have yet to properly tune the options there on both the client and the server. If I find that I have to change anything here, I'll both come back and edit this and mention it in a future post that I did.

From here, you should be able to mount the NFS share like so:

sudo mount /mnt/shared

You should see the files from the NFS server located in /mnt/shared. You should check to make sure that this auto-mounts it on boot too (that's what the auto and _netdev are supposed to do).

If you experience issues on boot (like me), you might see something like this buried in /var/log/syslog:

mount[586]: mount.nfs: Network is unreachable

....then we can quickly hack this by creating a script in the directory /etc/network/if-up.d. It should read something like this should fix the issue:

#!/usr/bin/env bash
mount /mnt/shared

Save this to /etc/network/if-up.d/cluster-shared-nfs for example, not forgetting to mark it as executable:

sudo chmod +x /etc/network/if-up.d/cluster-shared-nfs

Alternatively, there's autofs that can do this more intelligently if you prefer.

First Nomad Job: Docker Registry

Now that we've got shared storage online, it's time for the big moment. We're finally going to start our very first job on our Nomad cluster!

It's going to be a Docker registry, and in my very specific case I'm going to be marking it as insecure (gasp!) because it's only going to be accessible from the WireGuard VPN - which I figure provides the encryption and authentication for us to get started reasonably simply without jumping through too many hoops. In the future, I'll probably revisit this in a later post to tighten things up.

Tasks on a Nomad cluster take the form of a Nomad job file. These can written in JSON or HCL (Hashicorp Configuration Language). I'll be using HCL here, because it's easier to read and we're not after machine legibility yet at this stage.

Nomad job files work a little bit like Nginx config files, in that they have nested sequences of blocks in a hierarchical structure. They loosely follow the following pattern:

job > group > task

The job is the top-level block that contains everything else. tasks are the items that actually run on the cluster - e.g. a Docker container. groups are a way to logically group tasks in a job, and are not required as far as I can tell (but we'll use one here anyway just for illustrative purposes). Let's start with the job spec:

job "registry" {
    datacenters = ["dc1"]
    # The Docker registry *is* pretty important....
    priority = 80

    # If this task was a regular task, we'd use a constraint here instead & set the weight to -100
    affinity {
        attribute   = "${attr.class}"
        value       = "controller"
        weight      = 100
    }

    # .....

}

This defines a new job called registry, and it should be pretty straight forward. We don't need to worry about the datacenters definition there, because we've only got the 1 (so far?). We set a priority of 80, and get the job to prefer running on nodes with the controller class (though I observe that this hasn't actually made much of a difference to Nomad's scheduling algorithm at all).

Let's move on to the real meat of the job file: the task definition!

group "main" {
    task "registry" {
        driver = "docker"

        config {
            image = "registry:2"
            labels { group = "registry" }

            volumes = [
                "/mnt/shared/registry:/var/lib/registry"
            ]

            port_map {
                registry = 5000
            }
        }

        resources {
            network {
                port "registry" {
                    static = 5000
                }
            }
        }

        # .......
    }
}

There's quite a bit to unpack here. The task itself uses the Docker driver, which tells Nomad to run a Docker container.

In the config block, we define the Docker driver-specific settings. The docker image we're going to run is registry:2 where registry is the image name, and 2 is the tag. This will to automatically pulled from the Docker hub. Future tasks will pull docker images from our very own private Docker registry, which we're in the process of setting up :D

We also mount a directory into the Docker container to allow it to persist the images that we push to it. This is done through a volume, which is the Docker word for bind-mounting a specific directory on the host system into a given location inside the guest container. For me I'm (currently) going to store the Docker registry data at /mnt/shared/registry - you should update this if you want to store it elsewhere. Remember this this needs to be a location on your shared storage, as we don't know which node in the cluster the Docker registry is going to run on in advance.

The port_map allows us to tell Nomad the port(s) that our service inside the Docker container listens on, and attach a logical name to them. We can then expose them in the resources block. In this specific case, I'm forcing Nomad to statically allocate port 5000 on the host system to point to port 5000 inside the container, for reasons that will become apparent later. This is done with the static keyword there. If we didn't do this, Nomad would allocate a random port number (which is normally what we'd want, because then we can run lots of copies of the same thing at the same time on the same host).

The last block we need to add to complete the job spec file is the service block. with a service block, Nomad will inform Consul that a new service is running, which will then in turn allow us to query it via DNS.

service {
    name = "${TASK}"
    tags = [ "infrastructure" ]

    address_mode = "host"
    port = "registry"
    check {
        type        = "tcp"
        port        = "registry"
        interval    = "10s"
        timeout     = "3s"
    }

}

The service name here is pulled from the name of the task. We tell Consul about the port number by specifying the logical name we assigned to it earlier.

Finally, we add a health check, to allow Consul to keep an eye on the health of our Docker registry for us. This will appear as a green tick if all is well in the web interface, which we'll be getting to in a future post. The health check in question simply ensures that the Docker registry is listening via TCP on the port it should be.

Here's the completed job file:

job "registry" {
    datacenters = ["dc1"]
    # The Docker registry *is* pretty important....
    priority = 80

    # If this task was a regular task, we'd use a constraint here instead & set the weight to -100
    affinity {
        attribute   = "${attr.class}"
        value       = "controller"
        weight      = 100
    }

    group "main" {

        task "registry" {
            driver = "docker"

            config {
                image = "registry:2"
                labels { group = "registry" }

                volumes = [
                    "/mnt/shared/registry:/var/lib/registry"
                ]

                port_map {
                    registry = 5000
                }
            }

            resources {
                network {
                    port "registry" {
                        static = 5000
                    }
                }
            }

            service {
                name = "${TASK}"
                tags = [ "infrastructure" ]

                address_mode = "host"
                port = "registry"
                check {
                    type        = "tcp"
                    port        = "registry"
                    interval    = "10s"
                    timeout     = "3s"
                }

            }
        }

        // task "registry-web" {
        //  driver = "docker"
        // 
        //  config {
        //      // We're going to have to build our own - the Docker image on the Docker Hub is amd64 only :-/
        //      // See https://github.com/Joxit/docker-registry-ui
        //      image = ""
        //  }
        // }
    }
}

Save this to a file, and then run it on the cluster like so:

nomad job run path/to/job/file.nomad

I'm as of yet unsure as to whether Nomad needs the file to persist on disk to avoid it getting confused - so it's probably best to keep your job files in a permanent place on disk to avoid issues.

Give Nomad to start the job, and then you can check on it's status like so:

nomad job status

This will print a summary of the status of all jobs on the cluster. To get detailed information about our new job, do this:

nomad job status registry

It should show that 1 task is running, like this:

ID            = registry
Name          = registry
Submit Date   = 2020-04-26T01:23:37+01:00
Type          = service
Priority      = 80
Datacenters   = dc1
Namespace     = default
Status        = running
Periodic      = false
Parameterized = false

Summary
Task Group  Queued  Starting  Running  Failed  Complete  Lost
main        0       0         1        5       6         1

Latest Deployment
ID          = ZZZZZZZZ
Status      = successful
Description = Deployment completed successfully

Deployed
Task Group  Desired  Placed  Healthy  Unhealthy  Progress Deadline
main        1        1       1        0          2020-06-17T22:03:58+01:00

Allocations
ID        Node ID   Task Group  Version  Desired  Status   Created   Modified
XXXXXXXX  YYYYYYYY  main        4        run      running  6d2h ago  2d23h ago

Ignore the Failed, Complete, and Lost there in my output - I ran into some snags while learning the system and setting mine up :P

You should also be able to resolve the IP of your Docker registry via DNS:

dig +short registry.service.mooncarrot.space

mooncarrot.space is the root domain I've bought for my cluster. I highly recommend you do the same if you haven't already. Consul exposes all services under the service subdomain, so in the future you should be able to resolve the IP of all your services in the same way: service_name.service.DOMAIN_ROOT.

Take care to ensure that it's showing the right IP address here. In my case, it should be the IP address of the wgoverlay network interface. If it's showing the wrong IP address, you may need to carefully check the configuration of both Nomad and Consul. Specifically, start by checking the network_interface setting in the client block of your Nomad worker nodes from part 7 of this series.

Conclusion

We're getting there, slowly but surely. Today we've setup shared storage with NFS, and started our first Nomad job. In doing so, we've started to kick the tyres of everything we've installed so far:

wesher, our WireGuard Mesh VPN
Unbound, our DNS server
Consul, our service discovery superglue
Nomad, our task scheduler

Truly, we are standing on the shoulders of giants: a whole host of open-source software that thousands of people from across the globe have collaborated together to produce which makes this all possible.

Moving forwards, we're going to be putting that Docker registry to good use. More immediately, we're going to be setting up Fabio (who's documentation is only marginally better than Traefik's, but just good enough that I could figure out how to use it....) in order to take a peek at those cool web interfaces for Nomad and Consul that I keep talking about.

We're also going to be looking at setting up Vault for secret (and certificate, if all goes well) management.

Until then, happy cluster configuration! If you're confused about anything so far, please leave a comment below. If you've got a suggestion to make it even better, please comment also! I'd love to know.

Sources and further reading

Alternatives to NFS
- Gluster
- Ceph
/etc/network/if-up.d/ NFS automount trick
autofs
Docker Registry
- Deploy a registry server
Nomad
- Docker driver

Cluster, Part 7: Wrangling... boxes? | Expanding the Hashicorp stack with Docker and Nomad

Welcome back to part 7 of my cluster configuration series. Sorry this one's a bit late - the last one was a big undertaking, and I needed a bit of a rest :P

Anyway, I'm back at it with another part to my cluster series. For reference, here are all the posts in this series so far:

Don't forget that you can see all the latest posts in the cluster tag right here on my blog.

Last time, we lit the spark for the bonfire so to speak, that keeps track of what is running where. We also tied it into the internal DNS system that we setup in part 4, which will act as the binding fabric of our network.

In this post, we're going to be doing 4 very important things:

Installing Docker
Installing and configuring Hashicorp Nomad

This is set to be another complex blog post that builds on the previous ones in this series (remember that benign rabbit hole from a few blog posts ago?).

Above: Nomad is a bit like a railway network manager. It decides what is going to run where and at what time. Picture taken by me.

Installing Docker

Let's install Docker first. This should be relatively easy. According to the official Docker documentation, you can install Docker like so:

curl https://get.docker.com/ | sudo sh

I don't like piping to sh though (and neither should you), so we're going to be doing something more akin to the "install using the repository". As a reminder, I'm using Raspberry Pi 4s running Raspbian (well, DietPi - but that's a minor detail). If you're using a different distribution or CPU architecture, you'll need to read the documentation to figure out the specifics of installing Docker for your architecture.

For Raspberry Pi 4s at least, it looks a bit like this:

echo 'deb [arch=armhf] https://download.docker.com/linux/raspbian buster stable' | sudo tee /etc/apt/sources.list.d/docker.list
sudo apt update
sudo apt install docker-ce

Don't forget that if you're running an apt caching server, you'll need to tweak that https to be plain-old http. For the curious, my automated script for my automated Ansible replacement (see the "A note about cluster management" in part 6) looks like this:

#!/usr/bin/env bash
RUN "echo 'deb [arch=armhf] http://download.docker.com/linux/raspbian buster stable' | sudo tee /etc/apt/sources.list.d/docker.list";
RUN "sudo apt-get update";
RUN "sudo apt-get install --yes docker-ce";

Docker should install without issue - note that you need to install it on all nodes in the cluster. We can't really do anything meaningful with it yet though, as we don't yet have Nomad installed. Let's move on and install that then!

Installing Hashicorp Nomad

Nomad is what's known as a workload orchestrator. This means that it, given a bunch of jobs, decides what is going to run where. If a host goes down, it is also responsible for shuffling things around to compensate.

Nomad works on the concept of 'jobs', which can be handled any any 1 of a number of drivers. In our case, we're going to be using the built-in Docker driver, as we want to manage the running of lots of Docker containers across multiple hosts in our cluster.

After installing Consul last time, we can build on that with Nomad. The 2 actually integrate really nicely with each other. Nomad will, by default, seek out a local Consul daemon, use it to discover other hosts in the cluster, and hang it's own cluster from Consul. Neat!

Also like Consul, Nomad functions with servers and clients. The servers all talk to each other via the Raft consensus algorithm, and the clients a lightweight daemons that do that the servers tell them to. I'm going to have 3 servers and 2 clients, in the following layout:

Host #	Consul	Nomad
1	Server	Server
2	Server + Client	Client
3	Server + Client	Client
4	Client	Server + Client
5	Client	Server + Client

Just for the record, according to thee Nomad documentation it's not recommended that servers also act as clients, but I don't have enough hosts to avoid this yet.

With this in mind, let's install Nomad. Again, as last time, I've packaged Nomad in my at repository. If you haven't already, go and set it up now. Then, install nomad like so:

sudo apt install hashicorp-nomad

Also as last time, I've deliberately chosen a different name then the existing nomad package that you'll probably find in your distribution's repositories to avoid confusion during updates. If you're a systemd user, then I've also got a trio of packages that provide a systemd service file:

Package Name	Config file location
`hashicorp-nomad-systemd-server`	`/etc/nomad/server.hcl`
`hashicorp-nomad-systemd-client`	`/etc/nomad/client.hcl`
`hashicorp-nomad-systemd-both`	`/etc/nomad/both.hcl`

They all conflict with each other (such that you can only have 1 installed at a time), and the only difference between them is where the configuration file is located.

Install 1 of these (if required) now too with your package manager. If you're not a systemd user, consul your service manager's documentation and write a service definition. If you're willing, comment below and I'll include a note about it here!

Speaking of configuration files, we should write one for Nomad. Let's start off with the bits that will be common across all the config file variants:

bind_addr = "{{ GetInterfaceIP \"wgoverlay\" }}"

# Increase log verbosity
log_level = "INFO"

# Setup data dir
# The data directory used to store state and other persistent data. On client
# machines this is used to house allocation data such as downloaded artifacts
# used by drivers. On server nodes, the data dir is also used to store the
# replicated log.
data_dir = "/srv/nomad"

A few things to note here. log_level is mostly personal preference, so do whatever you like there. I'll probably tune it myself as I get more familiar with how everything works.

data_dir needs to be a path to a private root-owned directory on disk for the Nomad agent to store stuff locally to that node. It should not be shared with other nodes. If you installed one of the systemd packages above, /srv/nomad is created and properly permissed for you.

bind_addr tells Nomad which network interface to send management clustering traffic over. For me I'm using a WireGuard mesh VPN I setup in [part 4](), so I specify wgoverlay here.

Next, let's look at the server config:

# Enable the server
server {
    enabled = true

    # We've got 3 servers in the cluster at the moment
    bootstrap_expect = 3

    # Note that Nomad finds other servers automagically through the consul cluster

    # TODO: Enable this. Before we do we need to figure out how to move this sekret into vault though or something
    # encrypt = "SOME_VALUE_HERE"
}

Not much to see here. Don't forget to change the bootstrap_expect to a different value if you are going to have a different number of servers in your cluster (nodes that are just clients don't count).

Note that this isn't the complete server configuration file - you need to take both this and the above common bit to make the complete server configuration file.

Now, let's look at the client configuration:

client {
    enabled = true
    # Note that Nomad finds other servers automagically through the consul cluster

    # Just a worker, nothing special going on here
    node_class = "worker"

    # use wgoverlay for network fingerprinting and forwarding
    network_interface = "wgoverlay"

    # Nobody is allow to run as root - even if you *are* inside a container.....
    # For 1 thing it'll trigger a permission denied when writing to the NFS share
    options = {
        "user.blacklist" = "root"
    }
}

This is more interesting.

network_interface is really important if you're using a WireGuard mesh VPN like wesher that I setup and configured in part 4. By default, Nomad port forwards over all interfaces that make sense, and in this case gets it wrong.

This fixes that by telling it to explicitly port forward containers over the wgoverlay interface. If your network interface has a different name, this is the place to change it. It's a fairly common practice from what I can tell to have both a 'public' and a 'private' network in a cluster environment. The private network is usually trusted, and as such has lots of management traffic running over it. The public network is the one that's locked down that requests come in to from outside.

The "user.blacklist" = "root" here is a precaution that I may end up having to remove in future. It blocks any containers from running on this client from running as root inside the Docker container. This is actually worth remembering, because it's a bit of a security risk. This is a fail-safe to remind myself that it's a Bad Idea.

Apparently there are tactics that can be deployed to avoid running containers as root - even when you might think you need to. In addition, if there's no other way to avoid it, apparently there's a clever user namespace remapping trick one can deploy to avoid a container from having root privileges if it breaks out of it's container.

Another thing to note is that NFS shares often don't like you reading or writing files owned by root either, so if you're going to be saving data to a shared NFS partition like me, this is yet another reason to avoid root in your containers.

At this point it's also probably a good idea to talk a little bit about usernames - although we'll talk in more depth about this later. From my current understanding, the usernames inside a container aren't necessarily the same as those outside the container.

Every process runs under a specified username, but each username is backed by a given user id. It's this user id that is translated back into a username on the client machine when reading files from an NFS mount - hence why usernames in NFS shares can be somewhat odd.

Docker containers often have custom usernames created inside the containers for running processes inside the container with specific user ids. More on this later, but I plan to dive into this in the process of making my own Docker container images.

Anyway, we now have our configuration files for Nomad. For a client node, take the client config and the common config from the top of this section. For a server, take the server and common sections. For a node that's to act as both a client and a server, take all 3 sections.

Now that we've got that sorted, we should be able to start the Nomad agent:

sudo systemctl enable --now nomad.service

This is the same for all nodes in the cluster - regardless as to whether it's a client, a server, or both (this is also the reason that you can't have more than 1 of the systemd apt packages installed at once that I mentioned above).

If you're using the UFW firewall, then that will need configuring. For me, I'm allowing all traffic on the wgoverlay network interface that's acting as my trusted network:

sudo ufw allow in on wgoverlay

If you'd prefer not to do that, then you can allow only the specific ports through like so:

sudo ufw allow 4646/tcp comment nomad-http
sudo ufw allow 4647/tcp comment nomad-rpc
sudo ufw allow 4648/tcp comment nomad-serf

Note that this allows the traffic on all interfaces - these will need tweaking if you only want to allow the traffic in on a specific interface (which, depending on your setup, is probably a wise decision).

Anyway, you should now be able to ask the Nomad cluster for it's status like so:

nomad node status

...execute this from any server node in the cluster. It should give output like this:

ID        DC   Name         Class   Drain  Eligibility  Status
75188064  dc1  piano        worker  false  eligible     ready
9eb7a7a5  dc1  harpsicord   worker  false  eligible     ready
c0d23e71  dc1  saxophone    worker  false  eligible     ready
a837aaf4  dc1  violin       worker  false  eligible     ready

If you see this, you've successfully configured Nomad. Next, I recommend reading the Nomad tutorial and experimenting with some of the examples. In particular the Getting Started and Deploy and Manage Jobs topics are worth investigating.

Conclusion

In this post, we've installed Docker, and installed and configured Nomad. We've also touched briefly on some of the security considerations we need to be aware of when running things in Docker containers - much more on this in the future.

In future posts, we're going to look at setting up shared storage, so that jobs running on Nomad can be safely store state and execute on any client / worker node in the cluster while retaining access to said state information.

On the topic of Nomad, we're also going to look at running our first real job: a Docker registry, so that we can push our own custom Docker images to it when we've built them.

You may have noticed that both Nomad and Consul also come with a web interface. We're going to look at these too, but in order to do so we need a special container-aware reverse-proxy to act as a broker between 'cluster-space' (in which everything happens 'somewhere', and we don't really know nor do we particularly care where), and 'normal-network-space' (in which everything happens in clearly defined places).

I've actually been experiencing some issues with this, as I initially wanted to use Traefik for this purpose - but I ran into a number of serious difficulties with reading their (lack of) documentation. After getting thoroughly confused I'm now experimenting with Fabio (git repository) instead, which I'm getting on much better with. It's a shame really, I even got as far as writing the automated packaging script for Traefik - as evidenced by the traefik packages in my apt repository.

Until then though, happy cluster configuration! Feel free to post a comment below.

Found this interesting? Found a mistake? Confused about something? Comment below!

Stardust Blog

Tag Cloud

How (not) to recover a consul cluster

Raft quorum

Consul and Raft

The solution

Conclusion

Centralising logs with rsyslog

Preparing the server

Configuring the server

Configuring a client

Conclusion

Sources and further reading

Using whiptail for text-based user interfaces

Cluster, Part 12: TLS for Breakfast | Configuring Fabio for HTTPS

Sources and further reading

Cluster, Part 11: Lock and Key | Let's Encrypt DNS-01 for wildcard TLS certificates

Sweeping things under the carpet

Nomad

Conclusion

Sources and Further Reading

consulstatus: public status pages drawn from Consul

Cluster, Part 10: Dockerisification | Writing Dockerfiles

Cluster, Part 9: The Border Between | Load Balancing with Fabio

Cluster, Part 8: The Shoulders of Giants | NFS, Nomad, Docker Registry

NFS: Server

NFS: Client

First Nomad Job: Docker Registry

Conclusion

Sources and further reading

Cluster, Part 7: Wrangling... boxes? | Expanding the Hashicorp stack with Docker and Nomad

Installing Docker

Installing Hashicorp Nomad

Conclusion

Sources and Further Reading

Stardust
Blog