Stardust | Starbeamrainbowlabs

Compiling the wacom driver from source to fix tilt & rotation support

I was all sat down and setup to do some digital drawing the other day, and then I finally snapped. My graphics tablet (a secondhand Wacom Intuos Pro S from Vinted) - which supports pen tilt - was not functioning correctly. Due to a bug that has yet to be patched, the tilt X/Y coordinates were being wrongly interpreted as unsigned integers (i.e. uint32) instead of signed integers (e.g. int32). This had the effect of causing the rotational calculation to jump around randomly, making it difficult when drawing.

So, given that someone had kindly posted a source patch, I set about compiling the driver from source. For some reason that is currently unclear to me, it is not being merged into the main wacom tablet driver repository. This leaves compiling from source with the patch the only option here that is currently available.

It worked! I was so ecstatic. I had tilt functionality for the first time!

Fast-forward to yesterday....... and it broke again, and I first noticed because I am left-handed and I have a script that flips the mapping of the pad around so I can use it the opposite way around.

I have since fixed it, but the entire process took me long enough to figure out that I realised that I was halfway there to writing a blog post as a comment on the aforementioned GitHub issue, so I decided to just go the rest of the way and write this up into a full blog post / tutorially kinda thing and do the drawing I wanted to do in the first place tomorrow.

In short, there are 2 parts to this:

input-wacom, the kernel driver
xf86-input-wacom, the X11 driver that talks to the kernel driver

....and they both have to be compiled separately, as I discovered yesterday.

Who is this for?

If you've got a Wacom Intuos tablet that supports pen tilt / rotation, then this blog post is for you.

Mine is a Wacom Intuos Pro S PTH-460.

This tutorial has been written on Ubuntu 24.04, but it should work for other systems too.

If there's the demand I might put together a package and put it in my apt repo, though naturally this will be limited to the versions of Ubuntu I personally use on my laptop - though do tend to upgrade through the 6-monthly updates.

I could also put together an AUR package, but currently on the devices I run Artix (Arch derivative) I don't usually have a tilt-supporting graphics tablet physically nearby when I'm using them and they run Wayland for unavoidable reasons.

Linux MY_DEVICE_NAME 6.8.0-48-generic #48-Ubuntu SMP PREEMPT_DYNAMIC Fri Sep 27 14:04:52 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux

Kompiling the kernel module

Navigate to a clean directory somewhere persistent, as you may need to get back to it later.

If you have the kernel driver installed, then uninstall it now.

On Ubuntu / apt-based systems, they bundle the kernel module and the X11 driver bit all in a single package..... hence the reason why we hafta do all the legwork of compiling and installing both the kernel module and the X11 driver from source :-/

e.g. on Ubuntu:

sudo apt remove xserver-xorg-input-wacom

Then, clone the git repo and checkout the right branch:

git clone https://github.com/jigpu/input-wacom.git -b fix-445
cd input-wacom;

....then, ref the official instructions install build-time dependencies if required:

sudo apt-get install build-essential autoconf linux-headers-$(uname -r)

...check if you have these installed already by replacing apt-get install with apt-cache policy.

Then, build and install all-in-one:

if test -x ./autogen.sh; then ./autogen.sh; else ./configure; fi && make && sudo make install || echo "Build Failed"

....this will prompt for a password to install directly into your system. I think they recommend to do it this way to simplify the build process for people.

This should complete our khecklist for the kernel module, but to activate it you'll need to reboot.

Don't bother doing that right now though on Ubuntu, since we have the X11 driver to go. For users on systems lucky enough to split the 2 drivers up, then you can just reboot here.

You can check (after rebooting!) if you've got the right input-wacom kernel module with this command:

grep "" /sys/module/wacom*/version

....my research suggests you need to have a wacom tablet plugged in for this to work.

If you get something like this:

$ grep "" /sys/module/wacom*/version
v2.00

....then you're still using your distribution-provided wacom kernel module. Go uninstall it!

The output you're looking for should look a bit like this:

$ grep "" /sys/module/wacom*/version
v2.00-1.2.0.37.g2c27caa

Compiling the X11 driver

Next up is xf86-input-wacom, the X11 side of things.

Instructions for this are partially sourced from https://github.com/linuxwacom/xf86-input-wacom/wiki/Building-The-Driver#building-with-autotools.

First, install dependencies:

sudo apt-get install autoconf pkg-config make xutils-dev libtool xserver-xorg-dev$(dpkg -S $(which Xorg) | grep -Eo -- "-hwe-[^:]*") libx11-dev libxi-dev libxrandr-dev libxinerama-dev libudev-dev

Then, clone the git repository and checkout the latest release:

git clone https://github.com/linuxwacom/xf86-input-wacom.git
cd "xf86-input-wacom";
git tag; # Pick the latest one from this list
git switch "$(git tag | tail -n1)"; # Basically git switch TAG_NAME

It should be at the bottom, or at least that's what I found. For me, that was xf86-input-wacom-1.2.3.

Then, to build and install the software from source, run these 2 commands one at a time:

set -- --prefix="/usr" --libdir="$(readlink -e $(ls -d /usr/lib*/xorg/modules/input/../../../ | head -n1))"
if test -x ./autogen.sh; then ./autogen.sh "$@"; else ./configure "$@"; fi && make && sudo make install || echo "Build Failed"

Now you should have the X11 side of things installed. In my case that includes xsetwacom, the (questionably designed) CLI for managing the properties of connected graphics tablets.

If that is not the case for you, you can extract it from the Ubuntu apt package:

apt download xserver-xorg-input-wacom
dpkg -x DEB_FILEPATH_HERE .
ar xv DEB_FILEPATH_HERE # or, if you don't have dpkg for some reason

....then, go locate the tool and put it somewhere in your PATH. I recommend somewhere towards the end in case you forget and fiddle with your setup some more later, so it gets overridden automatically. When I was fidddling around, that was /usr/local/games for me.

Making X11 like the kernel Driver

Or also known as enabling hotplug support. Or getting the kernel module and X11 to play nicely with each other.

This is required to make udev (the daemon that listens for devices to be plugged into the machine and then performs custom actions on them) tell the X server that you've plugged in your graphics tablet, or X11 to recognise that tablet devices are indeed tablet devices, or something else vaguely similar to that effect.

Thankfully, this just requires the installation of a single configuration file in a directory that may not exist for you yet - especially if you uninstalled your distro's wacom driver package.

Do it like this:

mkdir -p /etc/X11/xorg.conf.d/;
sudo curl -sSv https://raw.githubusercontent.com/linuxwacom/xf86-input-wacom/refs/heads/master/conf/70-wacom.conf -o /etc/X11/xorg.conf.d/70-wacom.conf

Just case they move things around as I've seen happen in far too many tutorials with broken links before, the direct link to the exact commit of this file I used is:

https://github.com/linuxwacom/xf86-input-wacom/blob/47552e13e714ab6b8c2dcbce0d7e0bca6d8a8bf0/conf/70-wacom.conf

Final steps

With all that done and out of the way, reboot. This serves 2 purposes:

Reloading the correct kernel module
Restarting the X11 server so it has the new driver.

Make sure to use the above instructions to check you are indeed running the right version of the input-wacom kernel module.

If all goes well, tilt/rotation support should now work in the painting program of your choice.

For me, that's Krita, the AppImage of which I bundle into my apt repository because I like the latest version:

https://apt.starbeamrainbowlabs.com/

The red text "Look! Negative TX/TY (TiltX / TiltY) numbers!" crudely overlaid using the Shutter screenshotting tool on top of a screenshot of the Krita tablet tester with a red arrow pointing at the TX/TY values highlighted in yellow.

Conclusion

Phew, I have no idea where this blog post has come from. Hopefully it is useful to someone else out there who also owns an tilt-supporting wacom tablet who is encountering a similar kinda issue.

Ref teaching and the previous post, preparing teaching content is starting to slwo down now thankfully. Ahead are the uncharted waters of assessment - it is unclear to me how much energy that will take to deal with.

Hopefully though there will be more PhD time (post on PhD corrections..... eventually) and free energy to spend on writing more blog posts for here! This one was enjoyable to write, if rather unexpected.

Has this helped you? Are you still stuck? Do report any issues to the authors of the above two packages I've shown in this post!

Comments below are also appreciated, both large and small.

Inter-process communication between Javascript and Python

Often, different programming languages are good at different things. To this end, it is sometimes desirable to write different parts of a program in different languages. In no situation is this more apparent than with an application implementing some kind of (ethical, I hope) AI-based feature.

I'm currently kinda-sort-mayenbe thinking of implementing a lightweight web interface backed by an AI model (more details if the idea comes to fruition), and while I like writing web servers in Javascript (it really shines with asynchronous input/output), AI models generally don't like being run in Javascript very much - as I have mentioned before, Tensorflow.js has a number of bugs that mean it isn't practically useful for doing anything serious with AI.

Naturally, the solution then is to run the AI stuff in Python (yeah, Python sucks - believe me I know) since it has the libraries for it, and get Javascript/Node.js to talk to the Python subprocess via inter-process communication, or IPC.

While Node.js has a fanceh message-passing system it calls IPC, this doesn't really work when communicating with processes that don't also run Javascript/Node.js. To this end, the solution is to use the standard input (stdin) and standard output (stdout) of the child process to communicate:

A colourful diagram of the IPC setup implemented in this post. Node.js, Python, and Terminal are 3 different coloured boxes. Python talks to Node.js via stdin and stdout as input and output respectively. Python's stderr interacts direct with the terminal, as does Node.js' stdin, stdout, stderr.

(Above: A diagram of how the IPC setup we're going for works. Editing file)

This of course turned out to be more nuanced and complicated than I expected, so I thought I'd document it here - especially since the Internet was very unhelpful on the matter.

Let's start by writing the parent Node.js script. First, we need to spawn that Python subprocess, so let's do that:

import { spawn } from 'child_process';
const python = spawn("path/to/child.py", {
    stdio: [ "pipe", "pipe", "inherit" ]
});

...where we set stdin and stdout to pipe mode - which let's us interact with the streams - and the standard error (stderr) to inherit mode, which allows it to share the parent process' stderr. That way errors in the child process propagate upwards and end up in the same log file that the parent process sends its output to.

If you need to send the Python subprocess some data to start with, you have to wait until it is initialised to send it something:

python.on(`spawn`, () => {
    console.log(`[node:data:out] Sending initial data packet`);
    python.stdin.write(`start\n`);
});

...an easier alternative than message passing for small amounts of data would be to set an environment variable when you call child_process.spawn - i.e. env: { key: "value" } in the options object above.

Next, we need to read the response from the Python script. Let's do that next:

import nexline from 'nexline'; // Put this import at the top of the file

const reader = nexline({
    input: python.stdout,
})

for await(const line of reader) {
    console.log(`[node:data:in] ${line}`)
}

The simplest way to do this would be to listen for the data event on python.stdout, but this does not guarantee that each chunk that arrives is actually a line of data, since data between processes is not line-buffered like it is when displaying content in the terminal.

To fix this, I suggest using one of my favourite npm packages: nexline. Believe it or not, handling this issue efficiently with minimal buffering is a lot more difficult than it sounds, so it's just easier to pull in a package to do it for you.

With a nice little for await..of loop, we can efficiently read the responses from the Python child process.

If you were doing this for real, I would suggest wrapping this in an EventEmitter (Node.js) / EventTarget (WHAT WG browser spec, also available in Node.js).

Python child process

That's basically it for the child process, but what does the Python script look like? It's really quite easy actually:

import sys

sys.stderr.write(f"[python] hai\n")
sys.stderr.flush()

count = 0
for line in sys.stdin:
    sys.stdout.write(f"boop" + str(count) + "\n")
    sys.stdout.flush()
    count += 1

Easy! We can simply iterate sys.stdin to read from the parent Node.js process.

We can write to sys.stdout to send data back to the parent process, but it's important to call sys.stdout.flush()! Node.js doesn't have an equivalent 'cause it's smart, but in Python it may not actually send the response until who-know-when (if at all) unless you call .flush() to force it to. Think of it as batching graphics draw calls to increase efficiency, but in this case it doesn't work in our favour.

Conclusion

This is just a quick little tutorial on how to implement Javascript/Node.js <--> Python IPC. We deal im plain-text messages here, but I would recommend using JSON - JSON.stringify()/JSON.parse() (Javascript) | json.dumps() / json.loads (Python) - to serialise / deserialise messages to ensure robustness. JSON by default contains no newline characters and escapes any present into \n, so it should be safe in this instance.

See also JSON Lines, a related specification.

Until next time!

Code

index.mjs:

#!/usr/bin/env node
"use strict";

import { spawn } from 'child_process';
import nexline from 'nexline';

///
// Spawn subprocess
///
const python = spawn("/tmp/x/child.py", {
    env: {  // Erases the parent process' environment variables
        "TEST": "value"
    },
    stdio: [ "pipe", "pipe", "inherit" ]
});

python.on(`spawn`, () => {
    console.log(`[node:data:out] start`);
    python.stdin.write(`start\n`);
});

///
// Send stuff on loop - example
///
let count = 0;
setInterval(() => {
    python.stdin.write(`interval ${count}\n`);
    console.log(`[node:data:out] interval ${count}`);
    count++;
}, 1000);


///
// Read responses
///
const reader = nexline({
    input: python.stdout,
})

for await(const line of reader) {
    console.log(`[node:data:in] ${line}`)
}

child.py:

#!/usr/bin/env python3
import sys

sys.stderr.write(f"[python] hai\n")
sys.stderr.flush()

count = 0
for line in sys.stdin:
    # sys.stderr.write(f"[python:data:in] {line}\n")
    # sys.stderr.flush()

    sys.stdout.write(f"boop" + str(count) + "\n")
    sys.stdout.flush()
    count += 1

Defining AI: Sequenced models || LSTMs and Transformers

Hi again! I'm back for more, and I hope you are too! First we looked at word embeddings, and then image embeddings. Another common way of framing tasks for AI models is handling sequenced data, which I'll talk about briefly today.

Banner showing the text 'Defining AI' on top on translucent white vertical stripes against a voronoi diagram in white against a pink/purple background. 3 progressively larger circles are present on the right-hand side.

Sequenced data can mean a lot of things. Most commonly that's natural language, which we can split up into words (tokenisation) and then bung through some word embeddings before dropping it through some model.

That model is usually specially designed for processing sequenced data, such as an LSTM (or it's little cousin the GRU, see that same link). An LSTM is part of a class of models called Recurrent Neural Networks, or RNNs. These models work by feeding their output back in to themselves again along with the next element of input, thereby processing a sequence of items. The output bit that feeds back in is basically the memory of the thing that remembers stuff across multiple items in the sequence. The model can decide to add or remove things from this memory at each iteration, which is how it learns.

The problem with this class of model is twofold: firstly, it processes everything serially which means we can't compute it in parallel, and secondly if the sequence of data it processes is long enough it will forget things from earlier in the sequence.

In 2017 a model was invented that solves at least 1 of these issues: the Transformer. Instead of working like a recurrent network, a Transformer processes an entire sequence at once, and encodes a time signal (the 'positional embedding') into the input so it can keep track of what was where in the sequence - a downside to processing in parallel means you don't know which element of the sequence was where.

The transformer model also brought with it the concept of 'attention', which is where the model can decide what parts of the input data are important at each step. This helps the transformer to focus on the bits of the data that are relevant to the task at hand.

Since 2017, the number of variants of the original transformer has exploded, which stands testament to the popularity of this model architecture.

Regarding inputs and outputs, most transformers will take an input in the same shape as the word embeddings in the first post in this series, and will spit it out in the same shape - just potentially with a different shape to the embedding dimension:

A diagram explaining how a transformer works. A series of sine waves are added as a positional embedding to the data before it goes in.

In this fashion, transformers are only limited by memory requirements and computational expense with respect to sequence lengths, which has been exploited in some model designs to convince a transformer-style model to learn to handle some significantly long sequences.

This is just a little look at sequenced data modelling with LSTMs and Transformers. I seem to have my ordering a bit backwards here, so in the next few posts we'll get down to basics and explain some concepts like Tensors and shapes, loss functions and backpropagation, attention, and how AI models are put together.

Are there any AI-related concepts or questions you would like answering? Leave a comment below and I'll write another post in this series to answer your question.

Sources and further reading

The Illustrated Transformer
Illustrated Guide to LSTMs and GRUs: A step by step explanation
Attention is All You Need - the original paper about the transformer model architecture
Long Short-Term Memory - the original paper about the LSTM
Transformer models: an introduction and catalog
The Unreasonable Effectiveness of Recurrent Neural Networks - older post from 2015 but a very cool earlier character-based text generation model. The hallucination by that model is a great example of why modern Transformer-based text generation models (a topic for another time) like ChatGPT have a similar issue.

.desktop files: Launcher icons on Linux

Heya! Just thought I'd write a quick reminder post for myself on the topic of .desktop files. In most Linux distributions, launcher icons for things are dictated by files with the file extension .desktop.

Of course, most programs these days come with a .desktop file automatically, but if you for example download an AppImage, then you might not get an auto-generated one. You might also be packaging something for your distro's package manager (go you!) - something I do semi-regularly when apt repos for software I need isn't updated (see my apt repository!).

They can live either locally to your user account (~/.local/share/applications) or globally (/usr/share/applications), and they follow the XDG desktop entry specification (see also the Arch Linux docs page, which is fabulous as usual ✨). It's basically a fancy .ini file:

[Desktop Entry]
Encoding=UTF-8
Type=Application
Name=Krita
Comment=Krita: Professional painting and digital art creation program
Version=1.0
Terminal=false
Exec=/usr/local/bin/krita
Icon=/usr/share/icons/krita.png

Basically, leave the first line, the Type, the Version, the Terminal, and the Encoding directives alone, but the others you'll want to customise:

Name: The name of the application to be displayed in the launcher
Comment: The short 1-line description. Some distros display this as the tooltip on hover, others display it in other ways.
Exec: Path to the binary to execute. Prepend with env Foo=bar etc if you need to set some environment variables (e.g. running on a discrete card - 27K views? wow O.o)
Icon: Path to the icon to display as the launcher icon. For global .desktop files, this should be located somewhere in /usr/share/icon.

This is just the basics. There are many other directives you can include - like the Category directive, which describes - if your launcher supports categories - what categories a given launch icon should appear under.

Troubleshooting: It took me waaay too long to realise this, but if you have put your .desktop file in the right place and it isn't appearing - even after a relog - then the desktop-file-validate command could come in handy:

desktop-file-validate path/to/file.desktop

It validates the content of a given .desktop file. If it contains any errors, then it will complain about them for you - unlike your desktop environment which just ignores .desktop files that are invalid.

If you found this useful, please do leave a comment below about what you're creating launcher icons for!

Sources and further reading

Defining AI: Image segmentation

Welcome back to Defining AI! This series is all about defining various AI-related terms in short-and-sweet blog posts.

In the last post, we took a quick look at word embeddings, which is the key behind how AI models understand text. In this one, we're going to investigate image segmentation.

Image segmentation is a particular way of framing a learning task for an AI model that takes an image as an input, and then instead of classifying the image into a given set of categories, it classifies every pixel by the category each one belongs to.

The simplest form of this is what's called semantic segmentation, where we classify each pixel via a given set of categories - e.g. building, car, sky, road, etc if we were implementing a segmentation model for some automated vehicle.

If you've been following my PhD journey for a while, you'll know that it's not just images that can be framed as an 'image' segmentation task: any data that is 2D (or can be convinced to pretend to be 2D) can be framed as an image segmentation task.

The output of an image segmentation model is basically a 3D map. This map will obviously have the width and height, but then also have an extra dimension for the channel. This is best explained with an image:

(Above: a diagram explaining how the output of an image segmentation model is formatted. See below explanation. Extracted from my work-in-progress thesis!)

Essentially, each value in the 'channel' dimension will be the probability that pixel is that class. So, for example, a single pixel in a model for predicting the background and foreground of an image might look like this:

[ 0.3, 0.7 ]

....if we consider these classes:

[ background, foreground ]

....then this pixel has a 30% change of being a background pixel, and a 70% change of being a foreground pixel - so we'd likely assume it's a foreground pixel.

Built up over the course of an entire image, you end up with a classification of every pixel. This could lead to models that separate the foreground and background in live video feeds, autonomous navigation systems, defect detection in industrial processes, and more.

Some models, such as the Segment Anything Model (website) have even been trained to generically segment any input image, as in the above image where we have a snail sat on top of a frog sat on top of a turtle, which is swimming in the water.

As alluded to earlier, you can also feed in other forms of 2D data. For example, this paper) predicts rainfall radar data a given number of hours into the future from a sample at the present moment. Or, my own research approximates the function of a physics-based model!

That's it for this post. If you've got something you'd like defining, please do leave a comment below!

I'm not sure what it'll be next, but it might be either staying at a high-level and looking at different ways that we can frame tasks in AI models, or I could jump to a lower level to look at fundamentals like loss (error) functions, backpropagation, layers (AI models are made up of multiple smaller layers), etc. What would you like to see next?

Defining AI: Word embeddings

Hey there! It's been a while. After writing my thesis for the better part of a year, I've been getting a bit burnt out on writing - so unfortunately I had to take a break from writing for this blog. My thesis is almost complete though - more on this in the next post in the PhD update blog post series. Other higher-effort posts are coming (including the belated 2nd post on NLDL-2024), but in the meantime I thought I'd start a series on defining various AI-related concepts. Each post is intended to be relatively short in length to make them easier to write.

Normal scheduling will resume soon :-)

Banner showing the text 'Defining AI' on top on transclucent white vertical stripes against a voronoi diagram in white against a pink/purple background. 3 progressively larger circles are present on the right-hand side.

As you can tell by the title of this blog post, the topic for today is word embeddings.

AI models operate fundamentally on numerical values and mathematics - so naturally to process text one has to encode said text into a numerical format before it can be shoved through any kind of model.

This process of converting text to a numerically-encoded value is called word embedding!

As you might expect, there are many ways of doing this. Often, this involves looking at what other words a given word often appears next to. This could be for example framed as a task in which a model has to predict a word given the words immediately before and after it in a sentence, and then take the output of the last layer before the output layer as the embedding (word2vec).

Other models use matrix math to calculate this instead, producing a dictionary file as an output (GloVe | paper). Still others use large models that are trained to predict randomly masked words, and process entire sentences at once (BERT and friends) - though these are computationally expensive since every bit of text you want to embed has to get pushed through the model.

Then there's contrastive learning approaches. More on contrastive learning later in the series if anyone's interested, but essentially it learns by comparing pairs of things. This can lead to a higher-level representation of the input text, which can increase performance in some circumstances, and other fascinating side effects that I won't go into in this post. Chief among these is CLIP (blog post).

The idea here is that semantically similar words wind up having similar sorts of numbers in their numerical representation (we call this a vector). This is best illustrated with a diagram:

Words embedded with GloVe and displayed in a heatmap

I used GloVe to embed some words with GloVe (really easy to use since it's literally just a dictionary), and then used cosine distance to compute the similarity between the different words. Once done, I plotted this in a heatmap.

As you can see, rain and water are quite similar (1 = identical; 0 = completely different), but rain and unrelated are not really alike at all.

That's about the long and short of word embeddings. As always with these things, you can go into an enormous amount of detail, but I have to cut it off somewhere.

Are there any AI-related concepts or questions you would like answering? Leave a comment below and I'll write another post in this series to answer your question.

LaTeX templates for writing with the University of Hull's referencing style

Hello, 2024! I'm writing this while it is still some time before the new year, but I realised just now (a few weeks ago for you), that I never blogged about the LaTeX templates I have been maintaining for a few years by now.

It's no secret that I do all of my formal academic writing in LaTeX - a typesetting language that is the industry standard in the field of Computer Science (and others too, I gather). While it's a very flexible (and at times obtuse, but this is a tale for another time) language, actually getting started is a pain. To make this process easier, I have developed over the years a pair of templates for writing that make starting off much easier.

A key issue (and skill) in academic writing is properly referencing things, and most places have their own specific referencing style you have to follow. The University of Hull is no different, so I knew from the very beginning that I needed a solution.

I can't remember who I received it from, but someone (comment below if you remember who it was, and I'll properly credit!) gave me a .bst BibTeX referencing style file that matches the University of Hull's referencing style.

I've been using it ever since, and I have also applied a few patches to it for some edge cases I have encountered that it doesn't handle. I do plan on keeping it up to date for the forseeable future with any changes they make to the aforementioned referencing style.

My templates also include this .bst file to serve as a complete starting template. There's one with a full page title (e.g. for thesis, dissertations, etc), and another with just a heading that sits at the top of the document just like a paper you might find on Semantic Scholar.

Note that I do not guarantee that the referencing style matches the University of Hull's style. All I can say is that it works for me and implements this specific referencing style.

With that in mind, I'll leave the README of the git repository to explain the specifics of how to get started with them:

https://git.starbeamrainbowlabs.com/Demos/latex-templates

They are stored on my personal git server, but you should be able to clne them just fine. Patches are most welcome via email (check the homepage of my website!)!

Building the science festival demo: How to monkeypatch an npm package

A pink background dotted with bananas, with the patch-package logo front and centre, and the npm logo small in the top-left. Small brown package boxes are present in the bottom 2 corners.

In a previous post, I talked about the nuts and bolts of the demo on a technical level, and put it's all put together. I alluded to the fact that I had to monkeypatch Babylon.js to disable the gamepad support because it was horribly broken, and I wanted to dedicate an entire post to the subject.

Partly because it's a clever hack I used, and partly because if I ever need to do something similar again I want a dedicated tutorially-style post on how I did it so I can repeat the process.

Monkeypatching an npm package after installation in a reliable way is an inherently fragile task: it is not something you want to do if you can avoid. In some cases though, it's unavoidable:

If you're short on time, and need something to work
If you are going to submit a pull request to fix something now, but need an interim workaround until your pull request is accepted upstream
If upstream doesn't want to fix the problem, and you're forced to either maintain a patch or fork upstream into a new project, which is a lot more work.

We'll assume that one of these 3 cases is true.

In the game Factorio, there's a saying 'there's a mod for that' that is often repeated in response to questions in discourse about the game. The same is true of Javascript: If you need to do a non-trivial thing, there's usually an npm package that does it that you can lean on instead of reinventing the wheel.

In this case, that package is called patch-package. patch-package is a lovely little tool that enables you to do 2 related things:

a) Generate patch files simply by editing a given npm package in-situ b) Automatically and transparently apply generated patch files on npm install, requiring no additional setup steps should you clone your project down from its repository and run npm install.

Assuming you have a working setup with the target npm package you want to patch already installed, first install patch-package:

npm install --save patch-package

Note: We don't --save-dev here, because patch-package needs to run any time the target package is installed... not just in your development environment - unless the target package to patch is also a development dependency.

Next, delve into node_modules/ and directly edit the files associated with the target package you want to edit.

Sometimes, projects will ship multiple npm packages, with one being containing the pre-minified build distribution, and th other distributing the raw source - e.g. if you have your own build system like esbuild and want to tree-shake it.

This is certainly the case for Babylon.js, so I had to switch from the main babylonjs package to @babylon/core, which contains the source. Unfortunately official documentation for Babylon.js is rather inconsistent which can lead to confusion using the latter, but once I figured out how the imports worked it all came out in the wash.

Once done, generate the patch file for the target package like so:

npx patch-package your-package-name-here

This should create a patch file in the directory patches/ alongside your package.json file.

The final step is to enable automatic and transparent application of the new patch file on package installation. To do this, open up your package.json for editing, and add the following to the scripts object:

"scripts": {
    "postinstall": "patch-package"
}

...so a complete example might look a bit like this:

{
    "name": "research-smflooding-vis",
    "version": "1.0.0",
    "description": "Visualisations of the main smflooding research for outreach purposes",
    "main": "src/index.mjs",

    // ....

    "scripts": {
        "postinstall": "patch-package",
        "test": "echo \"No tests have been written yet.\"",
        "build": "node src/esbuild.mjs",
        "watch": "ESBUILD_WATCH=yes node src/esbuild.mjs"
    },

    // ......

    "dependencies": {
        // .....
    }
}

That's really all you need to do!

After you've applied the patch like this, don't forget to commit your changes to your git/mercurial/whatever repository.

I would also advise being a bit careful installing updates to any packages you've patched in future, in case of changes - though of course installing dependency package updates are vitally important to keep your code updated and secure.

As a rule of thumb, I recommend actively working to minimise the number of patches you apply to packages, and only use this method as a last resort.

That's all for this post. In future posts, I want to look more at the AI theory behind the demo, it's implications, and what it could mean for research in the field in the future (is there even a kind of paper one writes about things one learns from outreach activities that accidentally have a bearing on my actual research? and would it even be worth writing something formal? a question for my supervisor ~~and commenters on that blog post when it comes out~~ I think).

See you in the next post!

(Background to post banner: Unsplash)

AI encoders demystified

When beginning the process of implementing a new deep learning / AI model, alongside deciding on the input and outputs of the model and converting your data to tfrecord files (file extension: .tfrecord.gz; you really should, the performance boost is incredible) is designing the actual architecture of the model itself. It might be tempting to shove a bunch of CNN layers at the problem to make it go away, but in this post I hope to convince you to think again.

As with any field, extensive research has already been conducted, and when it comes to AI that means that the effective encoding of various different types of data with deep learning models has already been carefully studied.

To my understanding, there are 2 broad categories of data that you're likely to encounter:

Images: photos, 2D maps of sensor readings, basically anything that could be plotted as an image
Sequenced data: text, sensor readings with only a temporal and no spatial dimension, audio, etc

Sometimes you'll encounter a combination of these two. That's beyond the scope of this blog post, but my general advice is:

If it's 3D, consider it a 2D image with multiple channels to save VRAM
Go look up specialised model architectures for your problem and how well they work/didn't work
Use a Convolutional variant of a sequenced model or something
Pick one of these encoders and modify it slightly to fix your use case

In this blog post, I'm going to give a quick overview of the various state of the art encoders for the 2 categories of data I've found on my travels so far. Hopefully, this should give you a good starting point for building your own model. By picking an encoder from this list as a starting point for your model and reframing your problem just a bit to fit, you should find that you have not only a much simpler problem/solution, but also a more effective one too.

I'll provide links to the original papers for each model, but also links to any materials I've found that I found helpful in understanding how they work. This can be useful if you find yourself in the awkward situation of needing to implement them yourself, but it's also generally good to have an understanding of the encoder you ultimately pick.

I've read a number of papers, but there's always the possibility that I've missed something. If you think this is the case, please comment below. This is especially likely if you are dealing with audio data, as I haven't looked into handling that too much yet.

Finally, also out of scope of this blog post are the various ways of framing a problem for an AI to understand it better. If this interests you, please comment below and I'll write a separate post on that.

Images / spatial data

ConvNeXt: Image encoders are a shining example of how a simple stack of CNNs can be iteratively improved through extensive testing, and in no encoder is this more apparent than in ConvNeXt.

In 1 sentence, a ConvNeXt model can be summarised as "a ResNet but with the features of a transformer". Researchers took a bog-standard ResNet, and then iteratively tested and improved it by adding various features that you'd normally find in a transformer, which results in the model performant (*read: had the highest accuracy) encoder when compared to the other 2 on this list.

ConvNeXt is proof that CNN-based models can be significantly improved by incorporating more than just a simple stack of CNN layers.

Paper: A ConvNet for the 2020s
Explanation: ConvNext: The Return Of Convolution Networks
Implementation: https://github.com/leanderme/ConvNeXt-Tensorflow

Vision Transformer: Born from when someone asked what should happen if they tried to put an image into a normal transformer (see below), Vision Transformers are a variant of a normal transformer that handles images and other spatial data instead of sequenced data (e.g. sentences).

The current state of the art is the Swin Transformer to the best of my knowledge. Note that ConvNeXt outperforms Vision Transformers, but the specifics of course depend on your task.

Paper: Swin Transformer: Hierarchical Vision Transformer using Shifted Windows
Explanation: Swin Transformer: Hierarchical Vision Transformer using Shifted Windows - Committed towards better future
Implementation: https://github.com/microsoft/Swin-Transformer (PyTorch, if you know of a simple Tensorflow implementation please let me know)

ResNet: Extremely popular, you may have heard of the humble ResNet before. It was originally made a thing when someone wanted to know what would happen if they took the number of layers in a CNN-based model to the extreme. It has a 'residual connection' between blocks of layers, which avoids the 'vanishing gradient problem' which in short is when your model is too deep, and when it backpropagates the error the gradient it adjusts the weight by becomes so small that it has hardly any effect.

Since this model was invented, skip connections (or 'residual connections', as they are sometimes known) have become a regular feature in all sorts of models - especially those deeper than just a couple of layers.

Despite this significant advancement though, I recommend using a ConvNeXt encoder instead of images - it's better than and the architecture more well tuned than a bog standard ResNet.

Sequenced Data

Transformer: The big one that is quite possibly the most famous encoder-decoder ever invented. The Transformer replaces the LSTM (see below) for handling sequenced data. It has 2 key advantages over an LSTM:

It's more easily paralleliseable, which is great for e.g. training on GPUs
The attention mechanism it implements revolutionises the performance of like every network ever

The impact of the Transformer on AI cannot be overstated. In short: If you think you want an LSTM, use a transformer instead.

Paper: Attention is all you need
Explanation: The Illustrated Transformer
Implementation: Many around, but I have yet to find a simple enough one, so I implemented it myself from scratch. While I've tested the encoder and I know it works, I have yet to fully test the decoder. Once I have done this, I will write a separate blog post about it and add the link here.

LSTM: Short for Long Short-Term Memory, LSTMs were invented in 1997 to solve the vanishing gradient problem, in which gradients when backpropagating back through recurrent models that handle sequenced data shrink to vanishingly small values, rendering them ineffective at learning long-term relationships in sequenced data.

Superceded by the (non recurrent) Transformer, LSTMs are a recurrent model architecture, which means that their output feeds back into themselves. While this enables some unique model architectures for learning sequenced data (exhibit A), it also makes them horrible at being parallelised on a GPU (the dilated LSTM architecture attempts to remedy this, but Transformers are just better). The only thing that stands them apart from the Transformer is that the Transformer is somehow not built-in into Tensorflow as standard yet, whereas LSTMs are.

Just like Vision Transformers adapted the Transformer architecture for multidimensional data, so too are Grid LSTMs to normal LSTMs.

Paper: Long Short-Term Memory
Explanation: Understanding LSTMs,
Implementation: Tensorflow: tf.keras.layers.LSTM, PyTorch: torch.nn.LSTM

Conclusion

In summary, for images encoders in priority order are: ConvNeXt, Vision Transformer (Swin Transformer), ResNet, and for text/sequenced data: Transformer, LSTM.

We've looked at a small selection of model architectures for handling a variety of different data types. This is not an exhaustive list, however. You might have another awkward type of data to handle that doesn't fit into either of these categories - e.g. specialised models exist for handling audio, but the general rule of thumb is an encoder architecture probably already exists for your use-case - even if I haven't listed it here.

Also of note are alternative use cases for data types I've covered here. For example, if I'm working with images I would use a ConvNeXt, but if model prediction latency and/or resource consumption mattered I would consider using a MobileNet](https://www.tensorflow.org/api_docs/python/tf/keras/applications/mobilenet_v3), which while a smaller model is designed for producing rapid predictions in lower resource environments - e.g. on mobile phones.

Finally, while these are encoders decoders also exist for various tasks. Often, they are tightly integrated into the encoder. For example, the U-Net is designed for image segmentation. Listing these is out of scope of this article, but if you are getting into AI to solve a specific problem (as is often the case), I strongly recommend looking to see if an existing model/decoder architecture has been designed to solve your particular problem. It is often much easier to adjust your problem to fit an existing model architecture than it is to design a completely new architecture to fit your particular problem (trust me, I've tried this already and it was a Bad Idea).

The first one you see might even not be the best / state of the art out there - e.g. Transformers are better than the more widely used LSTMs. Surveying the landscape for your particular task (and figuring out how to frame it in the first place) is critical to the success of your model.

Easily write custom Tensorflow/Keras layers

At some point when working on deep learning models with Tensorflow/Keras for Python, you will inevitably encounter a need to use a layer type in your models that doesn't exist in the core Tensorflow/Keras for Python (from here on just simply Tensorflow) library.

I have encountered this need several times, and rather than e.g. subclassing tf.keras.Model, there's a much easier way - and if you just have a simple sequential model, you can even keep using tf.keras.Sequential with custom layers!

Background

First, some brief background on how Tensorflow is put together. The most important thing to remember is Tensorflow likes very much to compile things into native code using what we can think of as an execution graph.

In this case, by execution graph I mean a directed graph that defines the flow of information through a model or some other data processing pipeline. This is best explained with a diagram:

A simple stack of Keras layers illustrated as a directed graph

Here, we define a simple Keras AI model for classifying images which you might define with the functional API. I haven't tested this model - it's just to illustrate an example (use something e.g. like MobileNet if you want a relatively small model for image classification).

The layer stack starts at the top and works it's way downwards.

When you call model.compile(), Tensorflow complies this graph into native code for faster execution. This is important, because when you define a custom layer, you may only use Tensorflow functions to operate on the data, not Python/Numpy/etc ones.

You may have already encountered this limitation if you have defined a Tensorflow function with tf.function(some_function).

The reason for this is the specifics of how Tensorflow compiles your model. Now consider this graph:

A graph of Tensorflow functions

Basic arithmetic operations on tensors as well as more complex operators such as tf.stack, tf.linalg.matmul, etc operate on tensors as you'd expect in a REPL, but in the context of a custom layer or tf.function they operate on not a real tensor, but symbolic ones instead.

It is for this reason that when you implement a tf.function to use with tf.data.Dataset.map() for example, it only gets executed once.

Custom layers for the win!

With this in mind, we can relatively easily put together a custom layer. It's perhaps easiest to show a trivial example and then explain it bit by bit.

I recommend declaring your custom layers each in their own file.

import tensorflow as tf

class LayerMultiplier(tf.keras.layers.Layer):
    def __init__(self, multiplier=2, **kwargs):
        super(LayerMultiplier, self).__init__(**kwargs)

        self.param_multiplier = multiplier
        self.tensor_multiplier = tf.constant(multiplier, dtype=tf.float32)

    def get_config(self):
        config = super(LayerMultiplier, self).get_config()

        config.update({
            "multiplier": self.param_multiplier
        })

        return config

    def call(self, input_thing, training, **kwargs):
        return input_thing * self.tensor_multiplier

Custom layers are subclassed from tf.keras.layers.Layer. There are a few parts to a custom layer:

The constructor (__init__) works as you'd expect. You can take in custom (hyper)parameters (which should not be tensors) here and use then to control the operation of your custom layer.

get_config() must ultimately return a dictionary of arguments to pass to instantiate a new instance of your layer. This information is saved along with the model when you save a model e.g. with tf.keras.callbacks.ModelCheckpoint in .hdf5 mode, and then used when you load a model with tf.keras.models.load_model (more on loading a model with custom layers later).

A paradigm I usually adopt here is setting self.param_ARG_NAME_HERE fields in the constructor to the value of the parameters I've taken in, and then spitting them back out again in get_config().

call() is where the magic happens. This is called when you call model.compile() with a symbolic tensor which stands in for the shape of the real tensor to build an execution graph as explained above.

The first argument is always the output of the previous layer. If your layer expects multiple inputs, then this will be an array of (potentially symbolic) tensors rather then a (potentially symbolic) tensor directly.

The second argument is whether you are in training mode or not. You might not be in training mode if:

You are spinning over the validation dataset
You are making a prediction / doing inference
Your layer is frozen for some reason

Sometimes you may want to do something differently if you are in training mode vs not training mode (e.g. dataset augmentation), and Tensorflow is smart enough to ensure this is handled as you'd expect.

Note also here that I use a native multiplication with the asterisk * operator. This works because Tensorflow tensors (whether symbolic or otherwise) overload this and other operators so you don't need to call tf.math.multiply, tf.math.divide, etc explicitly yourself, which makes your code neater.

That's it, that's all you need to do to define a custom layer!

Using and saving

You can use a custom layer just like a normal one. For example, using tf.keras.Sequential:

import tensorflow as tf

from .components.LayerMultiply import LayerMultiply

def make_model(batch_size, multiplier):
    model = tf.keras.Sequential([
        tf.keras.layers.Dense(96),
        tf.keras.layers.Dense(32),
        LayerMultiply(multiplier=5)
        tf.keras.layers.Dense(10, activation="softmax"),
    ])
    model.build(input_shape=tf.TensorShape([ batch_size, 32 ]))
    model.compile(
        optimizer="Adam",
        loss=tf.keras.losses.SparseCategoricalCrossentropy(),
        metrics=[
            tf.keras.losses.SparseCategoricalAccuracy()
        ]
    )
    return model

The same goes here for the functional API. I like to put my custom layers in a components directory, but you can put them wherever you like. Again here, I haven't tested the model at all, it's just for illustrative purposes.

Saving works as normal, but for loading a saved model that uses a custom layer, you need to provide a dictionary of custom objects:

loaded_model = tf.keras.models.load_model(filepath_checkpoint, custom_objects={
    "LayerMultiply": LayerMultiply,
})

If you have multiple custom layers, define all the ones you use here. It doesn't matter if you define extra it seems, it'll just ignore the ones that aren't used.

Going further

This is far from all you can do. In custom layers, you can also:

Instantiate sublayers or models (tf.keras.Model inherits from tf.keras.layers.Layer)
Define custom trainable weights (tf.Variable)

Instantiating sublayers is very easy. Here's another example layer:

import tensorflow as tf

class LayerSimpleBlock(tf.keras.layers.Layer):
    def __init__(self, units, **kwargs):
        super(LayerSimpleBlock, self).__init__(**kwargs)

        self.param_units = units

        self.block = tf.keras.Sequential([
            tf.keras.layers.Dense(self.param_units),
            tf.keras.layers.Activation("gelu")
            tf.keras.layers.Dense(self.param_units),
            tf.keras.layers.LayerNormalization()
        ])

    def get_config(self):
        config = super(LayerSimpleBlock, self).get_config()

        config.update({
            "units": self.param_units
        })

        return config

    def call(self, input_thing, training, **kwargs):
        return self.block(input_thing, training=training)

This would work with a single sublayer too.

Custom trainable weights are also easy, but require a bit of extra background. If you're reading this post, you have probably heard of gradient descent. The specifics of how it works are out of scope of this blog post, but in short it's the underlying core algorithm deep learning models use to reduce error by stepping bit by bit towards lower error.

Tensorflow goes looking for all the weights in a model during the compilation process (see the explanation on execution graphs above) for you, and this includes custom weights.

You do, however, need to mark a tensor as a weight - otherwise Tensorflow will assume it's a static value. This is done through the use of tf.Variable:

tf.Variable(name="some_unique_name", initial_value=tf.random.uniform([64, 32]))

As far as I've seen so far, tf.Variable()s need to be defined in the constructor of a tf.keras.layers.Layer, for example:

import tensorflow as tf

class LayerSimpleBlock(tf.keras.layers.Layer):
    def __init__(self, **kwargs):
        super(LayerSimpleBlock, self).__init__(**kwargs)

        self.weight = tf.Variable(name="some_unique_name", initial_value=tf.random.uniform([64, 32]))

    def get_config(self):
        config = super(LayerSimpleBlock, self).get_config()
        return config

    def call(self, input_thing, training, **kwargs):
        return input_thing * weight

After you define a variable in the constructor, you can use it like a normal tensor - after all, in Tensorflow (and probably other deep learning frameworks too), tensors don't always have to hold an actual value at the time of execution as I explained above (I call tensors that don't contain an actual value like this symbolic tensors, since they are like stand-ins for the actual value that gets passed after the execution graph is compiled).

Conclusion

We've looked at defining custom Tensorflow/Keras layers that you can use without giving tf.keras.Sequential() or the functional API. I've shown how by compiling Python function calls into native code using an execution graph, many orders of magnitude of performance gains can be obtained, fully saturating GPU usage.

We've also touched on defining custom weights in custom layers, which can be useful depending on what you're implementing. As a side note, should you need a weight in a custom loss function, you'll need to define it in the constructor of a tf.keras.layers.Layer and then pull it out and pass it to your subclass of tf.keras.losses.Loss.

By defining custom Tensorflow/Keras layers, we can implement new cutting-edge deep learning logic that are easy to use. For example, I have implemented a Transformer with a trio of custom layers, and CBAM: Convolutional Block Attention Module also looks very cool - I might implement it soon too.

I haven't posted a huge amount about AI / deep learning on here yet, but if there's any topic (machine learning or otherwise) that you'd like me to cover, I'm happy to consider it - just leave a comment below.

Stardust Blog

Tag Cloud

Compiling the wacom driver from source to fix tilt & rotation support

Who is this for?

Kompiling the kernel module

Compiling the X11 driver

Making X11 like the kernel Driver

Final steps

Conclusion

Inter-process communication between Javascript and Python

Python child process

Conclusion

Code

Defining AI: Sequenced models || LSTMs and Transformers

Sources and further reading

.desktop files: Launcher icons on Linux

Sources and further reading

Defining AI: Image segmentation

Defining AI: Word embeddings

LaTeX templates for writing with the University of Hull's referencing style

Building the science festival demo: How to monkeypatch an npm package

AI encoders demystified

Images / spatial data

Sequenced Data

Conclusion

Easily write custom Tensorflow/Keras layers

Background

Custom layers for the win!

Using and saving

Going further

Conclusion

Stardust
Blog