Starbeamrainbowlabs

Stardust
Blog


Archive


Mailing List Articles Atom Feed Comments Atom Feed Twitter Reddit Facebook

Tag Cloud

3d 3d printing account algorithms android announcement architecture archives arduino artificial intelligence artix assembly async audio automation backups bash batch blender blog bookmarklet booting bug hunting c sharp c++ challenge chrome os cluster code codepen coding conundrums coding conundrums evolved command line compilers compiling compression conference conferences containerisation css dailyprogrammer data analysis debugging defining ai demystification distributed computing dns docker documentation downtime electronics email embedded systems encryption es6 features ethics event experiment external first impressions freeside future game github github gist gitlab graphics guide hardware hardware meetup holiday holidays html html5 html5 canvas infrastructure interfaces internet interoperability io.js jabber jam javascript js bin labs latex learning library linux lora low level lua maintenance manjaro minetest network networking nibriboard node.js open source operating systems optimisation outreach own your code pepperminty wiki performance phd photos php pixelbot portable privacy problem solving programming problems project projects prolog protocol protocols pseudo 3d python reddit redis reference release releases rendering research resource review rust searching secrets security series list server software sorting source code control statistics storage svg systemquery talks technical terminal textures thoughts three thing game three.js tool tutorial twitter ubuntu university update updates upgrade version control virtual reality virtualisation visual web website windows windows 10 worldeditadditions xmpp xslt

Automatically downloading emails and extracting their attachments

I have an all-in-one printer that's also a scanner - specifically the Epson Ecotank 4750 (though annoyingly the automated document feeder doesn't support duplex). While it's a great printer (very eco-friendly, and the inks last for ages!), my biggest frustration with it is that it doesn't scan directly to an SMB file share (i.e. a Windows file share). It does support SANE though, which allows you to use it through a computer.

This is ok, but the ability to scan directly from the device itself without needing to use a computer was very convenient, so I set out to remedy this. The printer does have a cloud feature they call "Epson Connect", which allows one to upload to various cloud services such as Google Drive and Box, but I don't want to upload potentially sensitive data to such services.

Fortunately, there's a solution at hand - email! The printer in question also supports scanning to a an email address. Once the scanning process is complete, then it sends an email to the preconfigured email address with the scanned page(s) attached. It's been far too long since my last post about email too, so let's do something about that.

Logging in to my email account just to pick up a scan is clunky and annoying though, so I decided to automate the process to resolve the issue. The plan is as follows:

  1. Obtain a fresh email address
  2. Use IMAP IDLE to instantly download emails
  3. Extract attachments and save them to the output directory
  4. Discard the email - both locally and remotely

As some readers may be aware, I run my own email server - hence the reason why I wrote this post about email previously, so I reconfigured it to add a new email address. Many other free providers exist out there too - just make sure you don't use an account you might want to use for anything else, since our script will eat any emails sent to it.

Steps 2, 3, and 4 there took some research and fiddling about, but in the end I cooked up a shell script solution that uses fetchmail, procmail (which is apparently unmaintained, so I should consider looking for alternatives), inotifywait, and munpack. I've also packaged it into a Docker container, which I'll talk about later in this post.

To illustrate how all of these fit together, let's use a diagram:

A diagram showing how the whole process fits together - explanation below.

fetchmail uses IMAP IDLE to hold a connection open to the email server. When it receives notification of a new email, it instantly downloads it and spawns a new instance of procmail to handle it.

procmail writes the email to a temporary directory structure, which a separate script is watching with inotifywait. As soon as procmail finishes writing the new email to disk, inotifywait triggers and the email is unpacked with munpack. Any attachments found are moved to the output directory, and the original email discarded.

With this in mind, let's start drafting up a script. The first order of the day is configuring fetchmail. This is done using a .fetchmailrc file - I came up with this:

poll bobsrockets.com protocol IMAP port 993
    user "[email protected]" with pass "PASSWORD_HERE"
    idle
    ssl

...where [email protected] is the email address you want to watch, bobsrockets.com is the domain part of said email address (everything after the @), and PASSWORD_HERE is the password required to login.

Save this somewhere safe with tight file permissions for later.

The other configuration file we'll need is one for procmail. let's do that one now:

CORRECTHOME=/tmp/maildir
MAILDIR=$CORRECTHOME/

:0
Mail/

Replace /tmp/maildir with the temporary directory you want to use to hold emails in. Save this as procmail.conf for later too.

Now we have the mail config files written, we need to install some software. I'm using apt on Debian (a minideb Docker container actually), so you'll need to adapt this for your own system if required.

sudo apt install ca-certificates fetchmail procmail inotify-tools mpack
# or, if you're using minideb:
install_packages ca-certificates fetchmail procmail inotify-tools mpack

fetchmail is for some strange reason extremely picky about the user account it runs under, so let's update the pre-created fetchmail user account to make it happy:

groupadd --gid 10000 fetchmail
usermod --uid 10000 --gid 10000 --home=/srv/fetchmail --uid=10000 --gi=10000 fetchmail
chown fetchmail:fetchmail /srv/fetchmail

fetchmail now needs that config file we created earlier. Let's update the permissions on that:

chmod 10000:10000 path/to/.fetchmailrc

If you're running on bare metal, move it to the /srv/fetchmail directory now. If you're using Docker, keep reading, as I recommend that this file is mounted using a Docker volume to make the resulting container image more reusable.

Now let's start drafting a shell script to pull everything together. Let's start with some initial setup:

#!/usr/bin/env bash

if [[ -z "${TARGET_UID}" ]]; then
    echo "Error: The TARGET_UID environment variable was not specified.";
    exit 1;
fi
if [[ -z "${TARGET_GID}" ]]; then
    echo "Error: The TARGET_GID environment variable was not specified.";
    exit 1;
fi
if [[ "${EUID}" -ne 0 ]]; then
    echo "Error: This Docker container must run as root because fetchmail is a pain, and to allow customisation of the target UID/GID (although all possible actions are run as non-root users)";
    exit 1;
fi

dir_mail_root="/tmp/maildir";
dir_newmail="${dir_mail_root}/Mail/new";
target_dir="/mnt/output";

fetchmail_uid="$(id -u "fetchmail")";
fetchmail_gid="$(id -g "fetchmail")";

temp_dir="$(mktemp --tmpdir -d "imap-download-XXXXXXX")";
on_exit() {
    rm -rf "${temp_dir}";
}
trap on_exit EXIT;

log_msg() {
    echo "$(date -u +"%Y-%m-%d %H:%M:%S") imap-download: $*";
}

This script will run as root, and fetchmail runs as UID 10000 and GID 10000, The reasons for this are complicated (and mostly have to do with my weird network setup). We look for the TARGET_UID and TARGET_GID environment variables, as these define the uid:gid we'll be setting files to before writing them to the output directory.

We also determine the fetchmail UID/GID dynamically here, and create a second temporary directory to work with too (the reasons for which will become apparent).

Before we continue, we need to create the directory procmail writes new emails to. Not because procmail won't create it on its own (because it will), but because we need it to exist up-front so we can watch it with inotifywait:

mkdir -p "${dir_newmail}";
chown -R "${fetchmail_uid}:${fetchmail_gid}" "${dir_mail_root}";

We're running as root, but we'll want to spawn fetchmail (and other things) as non-root users. Technically, I don't think you're supposed to use sudo in non-interactive scripts, and it's also not present in my Docker container image. The alternative is the setpriv command, but using it is rather complicated and annoying.

It's more powerful than sudo, as it allows you to specify not only the UID/GID a process runs as, but also the capabilities the process will have too (e.g. binding to low port numbers). There's a nasty bug one has to work around if one is using Docker too, so given all this I've written a wrapper function that abstracts all of this complexity away:

# Runs a process as another user.
# Ref https://github.com/SinusBot/docker/pull/40
# $1    The UID to run the process as.
# $2    The GID to run the process as.
# $3-*  The command (including arguments) to run
run_as_user() {
    run_as_uid="${1}"; shift;
    run_as_gid="${1}"; shift;
    if [[ -z "${run_as_uid}" ]]; then
        echo "run_as_user: No target UID specified.";
        return 1;
    fi
    if [[ -z "${run_as_gid}" ]]; then
        echo "run_as_user: No target GID specified.";
        return 2;
    fi

    # Ref https://github.com/SinusBot/docker/pull/40
    # WORKAROUND for `setpriv: libcap-ng is too old for "all" caps`, previously "-all" was used here
    # create a list to drop all capabilities supported by current kernel
    cap_prefix="-cap_";
    caps="$cap_prefix$(seq -s ",$cap_prefix" 0 "$(cat /proc/sys/kernel/cap_last_cap)")";

    setpriv --inh-caps="${caps}" --reuid "${run_as_uid}" --clear-groups --regid "${run_as_gid}" "$@";
    return "$?";
}

With this in hand, we can now wrap fetchmail and procmail in a function too:

do_fetchmail() {
    log_msg "Starting fetchmail";

    while :; do
        run_as_user "${fetchmail_uid}" "${fetchmail_gid}" fetchmail --mda "/usr/bin/procmail -m /srv/procmail.conf";

        exit_code="$?";
        if [[ "$exit_code" -eq 127 ]]; then
            log_msg "setpriv failed, exiting with code 127";
            exit 127;
        fi 

        log_msg "Fetchmail exited with code ${exit_code}, sleeping 60 seconds";
        sleep 60
    done
}

In short this spawns fetchmail as the fetchmail user we configured above, and also restarts it if it dies. If setpriv fails, it returns an exit code of 127 - so we catch that and don't bother trying again, as the issue likely needs manual intervention.

To finish the script, we now need to setup that inotifywait loop I mentioned earlier. Let's setup a shell function for that:


do_attachments() {
    while :; do # : = infinite loop
        # Wait for an update
        # inotifywait's non-0 exit code forces an exit for some reason :-/
        inotifywait -qr --event create --format '%:e %f' "${dir_newmail}";

        # Process new email here
    done
}

Processing new emails is not particularly difficult, but requires a sub loop because:

  • More than 1 email could be written at a time
  • Additional emails could slip through when we're processing the last one
while read -r filename; do

    # Process each email

done < <(find "${dir_newmail}" -type f);

Finally, we need to process each email we find in turn. Let's outline the steps we need to take:

  1. Move the email to that second temporary directory we created above (since the procmail directory might not be empty)
  2. Unpack the attachments
  3. chown the attach

Let's do this in chunks. First, let's move it to the temporary directory:

log_msg "Processing email ${filename}";

# Move the email to a temporary directory for processing
mv "${filename}" "${temp_dir}";

The filename environment variable there is the absolute path to the email in question, since we used find and passed it an absolute directory to list the contents of (as opposed to a relative path).

To find the filepath we moved it to, we need to do this:

filepath_temp="${temp_dir}/$(basename "${filename}")"

This is important for the next step, where we unpack it:

# Unpack the attachments
munpack -C "${temp_dir}" "${filepath_temp}";

Now that we've unpacked it, let's do a bit of cleaning up, by deleting the original email file and the .desc description files that munpack also generates:

# Delete the original email file and any description files
rm "${filepath_temp}";
find "${temp_dir}" -iname '*.desc' -delete;

Great! Now we have the attachments sorted, now all we need to do is chown them to the target UID/GID and move them to the right place.

chown -R "${TARGET_UID}:${TARGET_GID}" "${temp_dir}";
chmod -R a=rX,ug+w "${temp_dir}";

I also chmod the temporary directory too to make sure that the permissions are correct, because otherwise the mv command is unable to read the directory's contents.

Now to actually move all the attachments:

# Move the attachment files to the output directory
while read -r attachment; do
    log_msg "Extracted attachment ${attachment}";
    chmod 0775 "${attachment}";
    run_as_user "${TARGET_UID}" "${TARGET_GID}" mv "${attachment}" "${target_dir}";
done < <(find "${temp_dir}" -type f);

This is rather overcomplicated because of an older design, but it does the job just fine.

With that done, we've finished the script. I'll include the whole script at the bottom of this post.

Dockerification

If you're running on bare metal, then you can skip to the end of this post. Because I have a cluster, I want to be able to run this thereon. Since said cluster works with Docker containers, it's natural to Dockerise this process.

The Dockerfile for all this is surprisingly concise:

(Can't see the above? View it on my personal Git server instead)

To use this, you'll need the following files alongside it:

It exposes the following Docker volumes:

  • /mnt/fetchmailrc: The fetchmailrc file
  • /mnt/output: The target output directory

All these files can be found in this directory on my personal Git server.

Conclusion

We've strung together a bunch of different programs to automatically download emails and extract their attachments. This is very useful as for ingesting all sorts of different files. Things I haven't covered:

  • Restricting it to certain source email addresses to handle spam
  • Restricting the file types accepted (the file command is probably your friend)
  • Disallowing large files (most 3rd party email servers do this automatically, but in my case I don't have a limit that I know of other than my hard disk space)

As always, this blog post is both a reference for my own use and a starting point for you if you'd like to do this for yourself.

If you've found this useful, please comment below! I find it really inspiring / motivating to learn how people have found my posts useful and what for.

Sources and further reading

run.sh script

(Can't see the above? Try a this link, or alternatively this one (bash))

Deep dive: Email, Trust, DKIM, SPF, and more

Lots of parcels (Above: Lots of parcels. Hopefully you won't get this many through the door at once..... Source)

Now that I'm on holiday, I've got some time to write a few blog posts! As I've promised a few people a post on the email system, that's what I'll look at this this post. I'm going to take you on a deep dive through the email system and trust. We'll be journeying though the fields of DKIM signatures, and climb the SPF mountain. We'll also investigate why the internet needs to take this journey in the first place, and look at some of the challenges one faces when setting up their own mail server.

Hang on to your hats, ladies and gentlemen! If you get to the end, give yourself a virtual cookie :D

Before we start though, I'd like to mention that I'll be coming at this from the perspective of my own email server that I set up myself. Let me introduce to you the cast: Postfix (the SMTP MTA), Dovecot (the IMAP MDA), rspamd (the spam filter), and OpenDKIM (the thing that deals with DKIM signatures).

With that out of the way, let's begin! We'll start of our journey by mapping out the journey a typical email undertakes.

The path a typical email takes. See the explanation below.

Let's say Bob Kerman wants to send Bill an email. Here's what happens:

  1. Bill writes the email and hits send. His email client connects to his email server, logs in, and asks the server to deliver a message for him.
  2. The server takes the email and reads the From header (in this case it's [email protected]), figures out where the mail server is located, connects to it, and asks it to deliver Bob's message to Bill. mail.billsboosters.com takes the email and files it in Bill's inbox.
  3. Bill connects to his mail server and retrieves Bob's message.

Of course, this is simplified in several places. mail.bobsrockets.com will obviously need to do a few DNS lookups to find billsboosters.com's mail server and fiddle with the headers of Bob's message a bit (such as adding a Received header etc.), and smtp.billsboosters.com won't just accept the message for delivery without checking out the server it came from first. How does it check though? What's preventing seanssatellites.net pretending to be bobsrockets.com and sending an imposter?

Until relatively recently, the answer was, well, nothing really. Anyone could send an email to anyone else without having to prove that they could indeed send email in the name of a domain. Try it out for yourself by telnetting to a mail server on port 25 (unencrypted SMTP) and trying in something like this:

HELO mail.bobsrockets.com
MAIL From: <[email protected]>
RCPT TO <[email protected]>
DATA
From: [email protected]
To: [email protected]

Hello! This is a email to remind you.....
.
QUIT

Oh, my! Frank at franksfuel.io can connect to any mail server and pretend that [email protected] is sending a message to [email protected]! Mail servers that allow this are called open relays, and today they usually find themselves on several blacklists within minutes. Ploys like these are easy to foil, thankfully (by only accepting mail for your own domains), but it still leaves the problem of what to do about random people connecting to your mail server delivering spam to your inbox that claims to be from someone they aren't supposed to be sending mail for.

In response, some mail servers demanded things like the IP that connects to send an email must reverse to the domain name that they want to send email from. Clever, but when you remember that anyone can change their own PTR records, you realise that it's just a minor annoyance to the determined spammer, and another hurdle to the legitimate person in setting up their own mail server!

Clearly, a better solution is needed. Time to introduce our first destination: SPF. SPF stands for sender policy framework, and defines a mechanism by which a mail server can determine which IP addresses a domain allows mail to be sent from in it's name. It's a TXT record that sites at the root of a domain. It looks something like this:

v=spf1 a mx ptr ip4:5.196.73.75 ip6:2001:41d0:e:74b::1 a:starbeamrainbowlabs.com a:mail.starbeamrainbowlabs.com -all

The above is my SPF TXT record for starbeamrainbowlabs.com. It's quite simple, really - let's break it down.

v=spf1

This just defines the version of the SPF standard. There's only one version so far, so we include this to state that this record is an SPF version 1 record.

a mx ptr

This says that the domain that the sender claims to be from must have an a and an mx record that matches the IP address that's sending the email. It also says that the ptr record associated with the sender's IP must resolve to the domain the sender claims to be sending from, as described above (it does help with dealing with infected machines and such).

ip4:5.196.73.75 ip6:2001:41d0:e:74b::1

This bit says that the IP addresses 5.196.73.75 and 2001:41d0:e:74d::1 are explicitly allowed to send mail in the name of starbeamrainbowlabs.com.

a:starbeamrainbowlabs.com a:mail.starbeamrainbowlabs.com

After all of the above, this bit isn't strictly necessary, but it says that all the IP addresses found in the a records for starbeamrainbowlabs.com and mail.starbeamrainbowlabs.com are allowed to send mail in the name of starbeamrainbowlabs.com.

-all

Lastly, this says that if you're not on the list, then your message should be rejected! Other variants on this include ~all (which says "put it in the spam box instead"), and +all (which says "accept it anyway", though I can't see how that's useful :P).

As you can see, SPF allows a mail server to verify if a given client is indeed allowed to send an email in the name of any particular domain name. For a while, this worked a treat - until a new problem arose.

Many of the mail servers on the internet don't (and probably still don't!) support encryption when connecting to and delivering mail, as certificates were expensive and difficult to get hold of (nowadays we've got LetsEncrypt who give out certificates for free!). The encryption used when mail servers connect to one another is practically identical to that used in HTTPS - so if done correctly, the identity of the remote server can be verified and the emails exchanged encrypted, if the world's certification authorities aren't corrupted, of course.

Since most emails weren't encrypted when in transit, a new problem arose: man-in-the-middle attacks, whereby an email is altered by one or more servers in the delivery chain. Thinking about it - this could still happen today even with encryption, if any one server along an email's route is compromised. To this end, another mechanism was desperately needed - one that would allow the receiving mail server to verify that an email's content / headers hadn't been surreptitiously altered since it left the origin mail server - potentially preventing awkward misunderstandings.

Enter stage left: DKIM! DKIM stands for Domain Keys Identified Mail - which, in short, means that it provides a method by which a receiving mail server can cryptographically prove that a message hasn't been altered during transit.

It works by having a public-private keypair, in which the public key can only decrypt things, but the private key is capable of encrypting things. A hash of the email's headers / content is computed and encrypted with the private key. Then the encrypted hash is attached to the email in the DKIM-Signature header.

The receiving mail server does a DNS lookup to find the public key, and decrypts the hash. It then computes it's own hash of the email headers / content, and compares it against the decrypted hash. If it matches, then the email hasn't been fiddled with along the way!

Of course, not all the headers in the email are hashed - only a specific subset are included in the hash, since some headers (like Received and X-Spam-Result) are added and altered during transit. If you're interested in implementing DKIM yourself - DigitalOcean have a smashing tutorial on the subject, which should adapt easily to whatever system you're running yourself.

With both of those in place, billsboosters.com's mail server can now verify that mail.bobsrockets.com is allowed to send the email on behalf of bobsrockets.com, and that the message content hasn't been tampered with since it left mail.bobsrockets.com. mail.billsboosters.com can also catch franksfuel.io in the act of trying to deliver spam from seanssatellites.net!

There is, however, one last piece of the puzzle left to reveal. With all this in place, how do you know if your mail was actually delivered? Is it possible to roll SPF and DKIM out gradually so that you can be sure you've done it correctly? This can be a particular issue for businesses and larger email server setups.

This is where DMARC comes in. It's a standard that lets you specify an email address you'd like to receive DMARC reports at, which contain statistics as to how many messages receiving mail servers got that claimed to be from you, and what they did with them. It also lets you specify what percentage of messages should be subject to DMARC filtering, so you can roll everything out slowly. Finally, it lets you specify what should happen to messages that fail either SPF, DKIM, or both - whether they should be allowed anyway (for testing purposes), quarantined, or rejected.

DMARC policies get specified (yep, you guessed it!) in a DNS record. unlike SPF though, they go in _dmarc.megsmicroprocessors.org as a TXT record, substituting megsmicroprocessors.org for your domain name. Here's an example:

v=DMARC1; p=none; rua=mailto:[email protected]

This is just a simple example - you can get much more complex ones than this! Let's go through it step by step.

v=DMARC1;

Nothing to see here - just a version number as in SPF.

p=none;

This is the policy of what should happen to messages that fail. In this example we've used none, so messages that fail will still pass right on through. You can set it to quarantine or even reject as you gain confidence in your setup.

rua=mailto:[email protected]

This specifies where you want DMARC reports to be sent. Each mail server that receives mail from your mail server will bundle up statistics and send them once a day to this address. The format is in XML (which won't be particularly easy to read), but there are free DMARC record parsers out there on the internet that you can use to decode the reports, like dmarcian.

That completes the puzzle. If you're still reading, then congratulations! Post in the comments and say hi :D We've climbed the SPF mountain and discovered how email servers validate who is allowed to send mail in the name of another domain. We've visited the DKIM signature fields and seen how the content of email can be checked to see if it's been altered during transit. Lastly, we took a stroll down DMARC lane to see how it's possible to be sure what other servers are doing with your mail, and how a large email server setup can implement DMARC, DKIM, and SPF more easily.

Of course, I'm not perfect - if there's something I've missed or got wrong, please let me know! I'll try to correct it as soon as possible.

Lastly, this is, as always, a starting point - not an ending point. An introduction if you will - it's up to you to research each technology more thoroughly - especially if you're thinking of implementing them yourself. I'll leave my sources at the bottom of this post if you'd like somewhere to start looking :-)

Sources and Further Reading

Art by Mythdael