Starbeamrainbowlabs

Stardust
Blog


Archive


Mailing List Articles Atom Feed Comments Atom Feed Twitter Reddit Facebook

Tag Cloud

3d 3d printing account algorithms android announcement architecture archives arduino artificial intelligence artix assembly async audio automation backups bash batch blender blog bookmarklet booting bug hunting c sharp c++ challenge chrome os cluster code codepen coding conundrums coding conundrums evolved command line compilers compiling compression conference conferences containerisation css dailyprogrammer data analysis debugging defining ai demystification distributed computing dns docker documentation downtime electronics email embedded systems encryption es6 features ethics event experiment external first impressions freeside future game github github gist gitlab graphics guide hardware hardware meetup holiday holidays html html5 html5 canvas infrastructure interfaces internet interoperability io.js jabber jam javascript js bin labs latex learning library linux lora low level lua maintenance manjaro minetest network networking nibriboard node.js open source operating systems optimisation outreach own your code pepperminty wiki performance phd photos php pixelbot portable privacy problem solving programming problems project projects prolog protocol protocols pseudo 3d python reddit redis reference release releases rendering research resource review rust searching secrets security series list server software sorting source code control statistics storage svg systemquery talks technical terminal textures thoughts three thing game three.js tool tutorial twitter ubuntu university update updates upgrade version control virtual reality virtualisation visual web website windows windows 10 worldeditadditions xmpp xslt

Defining AI: Sequenced models || LSTMs and Transformers

Hi again! I'm back for more, and I hope you are too! First we looked at word embeddings, and then image embeddings. Another common way of framing tasks for AI models is handling sequenced data, which I'll talk about briefly today.

Banner showing the text 'Defining AI' on top on translucent white vertical stripes against a voronoi diagram in white against a pink/purple background. 3 progressively larger circles are present on the right-hand side.

Sequenced data can mean a lot of things. Most commonly that's natural language, which we can split up into words (tokenisation) and then bung through some word embeddings before dropping it through some model.

That model is usually specially designed for processing sequenced data, such as an LSTM (or it's little cousin the GRU, see that same link). An LSTM is part of a class of models called Recurrent Neural Networks, or RNNs. These models work by feeding their output back in to themselves again along with the next element of input, thereby processing a sequence of items. The output bit that feeds back in is basically the memory of the thing that remembers stuff across multiple items in the sequence. The model can decide to add or remove things from this memory at each iteration, which is how it learns.

The problem with this class of model is twofold: firstly, it processes everything serially which means we can't compute it in parallel, and secondly if the sequence of data it processes is long enough it will forget things from earlier in the sequence.

In 2017 a model was invented that solves at least 1 of these issues: the Transformer. Instead of working like a recurrent network, a Transformer processes an entire sequence at once, and encodes a time signal (the 'positional embedding') into the input so it can keep track of what was where in the sequence - a downside to processing in parallel means you don't know which element of the sequence was where.

The transformer model also brought with it the concept of 'attention', which is where the model can decide what parts of the input data are important at each step. This helps the transformer to focus on the bits of the data that are relevant to the task at hand.

Since 2017, the number of variants of the original transformer has exploded, which stands testament to the popularity of this model architecture.

Regarding inputs and outputs, most transformers will take an input in the same shape as the word embeddings in the first post in this series, and will spit it out in the same shape - just potentially with a different shape to the embedding dimension:

A diagram explaining how a transformer works. A series of sine waves are added as a positional embedding to the data before it goes in.

In this fashion, transformers are only limited by memory requirements and computational expense with respect to sequence lengths, which has been exploited in some model designs to convince a transformer-style model to learn to handle some significantly long sequences.


This is just a little look at sequenced data modelling with LSTMs and Transformers. I seem to have my ordering a bit backwards here, so in the next few posts we'll get down to basics and explain some concepts like Tensors and shapes, loss functions and backpropagation, attention, and how AI models are put together.

Are there any AI-related concepts or questions you would like answering? Leave a comment below and I'll write another post in this series to answer your question.

Sources and further reading

Defining AI: Image segmentation

Banner showing the text 'Defining AI' on top on translucent white vertical stripes against a voronoi diagram in white against a pink/purple background. 3 progressively larger circles are present on the right-hand side.

Welcome back to Defining AI! This series is all about defining various AI-related terms in short-and-sweet blog posts.

In the last post, we took a quick look at word embeddings, which is the key behind how AI models understand text. In this one, we're going to investigate image segmentation.

Image segmentation is a particular way of framing a learning task for an AI model that takes an image as an input, and then instead of classifying the image into a given set of categories, it classifies every pixel by the category each one belongs to.

The simplest form of this is what's called semantic segmentation, where we classify each pixel via a given set of categories - e.g. building, car, sky, road, etc if we were implementing a segmentation model for some automated vehicle.

If you've been following my PhD journey for a while, you'll know that it's not just images that can be framed as an 'image' segmentation task: any data that is 2D (or can be convinced to pretend to be 2D) can be framed as an image segmentation task.

The output of an image segmentation model is basically a 3D map. This map will obviously have the width and height, but then also have an extra dimension for the channel. This is best explained with an image:

(Above: a diagram explaining how the output of an image segmentation model is formatted. See below explanation. Extracted from my work-in-progress thesis!)

Essentially, each value in the 'channel' dimension will be the probability that pixel is that class. So, for example, a single pixel in a model for predicting the background and foreground of an image might look like this:

[ 0.3, 0.7 ]

....if we consider these classes:

[ background, foreground ]

....then this pixel has a 30% change of being a background pixel, and a 70% change of being a foreground pixel - so we'd likely assume it's a foreground pixel.

Built up over the course of an entire image, you end up with a classification of every pixel. This could lead to models that separate the foreground and background in live video feeds, autonomous navigation systems, defect detection in industrial processes, and more.

Some models, such as the Segment Anything Model (website) have even been trained to generically segment any input image, as in the above image where we have a snail sat on top of a frog sat on top of a turtle, which is swimming in the water.

As alluded to earlier, you can also feed in other forms of 2D data. For example, this paper) predicts rainfall radar data a given number of hours into the future from a sample at the present moment. Or, my own research approximates the function of a physics-based model!


That's it for this post. If you've got something you'd like defining, please do leave a comment below!

I'm not sure what it'll be next, but it might be either staying at a high-level and looking at different ways that we can frame tasks in AI models, or I could jump to a lower level to look at fundamentals like loss (error) functions, backpropagation, layers (AI models are made up of multiple smaller layers), etc. What would you like to see next?

Defining AI: Word embeddings

Hey there! It's been a while. After writing my thesis for the better part of a year, I've been getting a bit burnt out on writing - so unfortunately I had to take a break from writing for this blog. My thesis is almost complete though - more on this in the next post in the PhD update blog post series. Other higher-effort posts are coming (including the belated 2nd post on NLDL-2024), but in the meantime I thought I'd start a series on defining various AI-related concepts. Each post is intended to be relatively short in length to make them easier to write.

Normal scheduling will resume soon :-)

Banner showing the text 'Defining AI' on top on transclucent white vertical stripes against a voronoi diagram in white against a pink/purple background. 3 progressively larger circles are present on the right-hand side.

As you can tell by the title of this blog post, the topic for today is word embeddings.

AI models operate fundamentally on numerical values and mathematics - so naturally to process text one has to encode said text into a numerical format before it can be shoved through any kind of model.

This process of converting text to a numerically-encoded value is called word embedding!

As you might expect, there are many ways of doing this. Often, this involves looking at what other words a given word often appears next to. This could be for example framed as a task in which a model has to predict a word given the words immediately before and after it in a sentence, and then take the output of the last layer before the output layer as the embedding (word2vec).

Other models use matrix math to calculate this instead, producing a dictionary file as an output (GloVe | paper). Still others use large models that are trained to predict randomly masked words, and process entire sentences at once (BERT and friends) - though these are computationally expensive since every bit of text you want to embed has to get pushed through the model.

Then there's contrastive learning approaches. More on contrastive learning later in the series if anyone's interested, but essentially it learns by comparing pairs of things. This can lead to a higher-level representation of the input text, which can increase performance in some circumstances, and other fascinating side effects that I won't go into in this post. Chief among these is CLIP (blog post).

The idea here is that semantically similar words wind up having similar sorts of numbers in their numerical representation (we call this a vector). This is best illustrated with a diagram:

Words embedded with GloVe and displayed in a heatmap

I used GloVe to embed some words with GloVe (really easy to use since it's literally just a dictionary), and then used cosine distance to compute the similarity between the different words. Once done, I plotted this in a heatmap.

As you can see, rain and water are quite similar (1 = identical; 0 = completely different), but rain and unrelated are not really alike at all.


That's about the long and short of word embeddings. As always with these things, you can go into an enormous amount of detail, but I have to cut it off somewhere.

Are there any AI-related concepts or questions you would like answering? Leave a comment below and I'll write another post in this series to answer your question.

Building the science festival demo: technical overview

Banner showing gently coloured point clouds of words against a dark background on the left, with the Humber Science Festival logo, fading into a screenshot of the attract screen of the actual demo on the right.

Hello and welcome to the technical overview of the hull science festival demo I did on the 9th September 2023. If you haven't already, I recommend reading the main release post for context, and checking out the live online demo here: https://starbeamrainbowlabs.com/labs/research-smflooding-vis/

I suspect that a significant percentage of the readers of my blog here love technical nuts and bolts of things (though you're usually very quiet in the comments :P), so I'm writing a series of posts about various aspects of the demo, because it's was a very interesting project.

In this post, we'll cover the technical nuts and bolts of how I put it together, the software and libraries I used, and the approach I took. I also have another post written I'll be posting after this one on monkeypatching npm packages after you install them, because I wanted that to be it's own post. In a post after that we'll look look at the research and the theory behind the project and how it fits into my PhD and wider research.

To understand the demo, we'll work backwards and deconstruct it piece by piece - starting with what you see on screen.

Browsing for a solution

As longtime readers of my blog here will know, I'm very partial to cramming things into the browser that probably shouldn't run in one. This is also the case for this project, which uses WebGL and the HTML5 Canvas.

Of course, I didn't implement using the WebGL API directly. That's far too much effort. Instead, I used a browser-based game engine called Babylon.js. Babylon abstracts the complicated bits away, so I can just focus on implementing the demo itself and not reinventing the wheel.

Writing code in Javascript is often an exercise in putting lego bricks together (which makes it very enjoyable, since you rarely have to deviate from your actual task due to the existence of npm). To this end, in the process of implementing the demo I collected a bunch of other npm packages together to which I could then delegate various tasks:

Graphics are easy

After picking a game engine, it is perhaps unsurprising that the graphics were easy to implement - even with 70K points to display. I achieved this with Babylon's PointsCloudSystem class, which made the display of the point cloud a trivial exercise.

After adapting and applying a clever plugin (thanks, @sebavan!), I had points that were further away displaying smaller and closer ones larger. Dropping in a perceptually uniform colour map (I wonder if anyone's designed a perceptually uniform mesh map for a 3D volume?) and some fog made the whole thing look pretty cool and intuitive to navigate.

Octopain

Now that I had the points displaying, the next step was to get the text above easy point displaying properly. Clearly with 70K points (140K in the online demo!) I can't display text for all of them at once (and it would look very messy if I did), so I needed to index them somehow and efficiently determine which points were near to the player in real time. This is actually quite a well studied problem, and from prior knowledge I remember that Octrees were reasonably efficient. If I had some tine to sit down and read papers (a great pastime), this one (some kind of location recognition from point clouds; potentially indoor/outdoor tracking) and this one (AI semantic segmentation of point clouds) look very interesting.

Unfortunately, the task of extracting a list of points within a given radius was not something commonly implemented in octree implementations on npm, and combined with a bit of headache figuring out the logic of this and how to hook it up to the existing Babylon renderer resulted in this step taking some effort before I found octree-es and got it working the way I wanted it to.

In the end, I had the octree as a completely separate point indexing data structure, and I used the word as a key to link it with the PointsCloudSystem in babylon.

Gasp, is that a memory leaks I see?!

Given I was in a bit of a hurry to get the whole demo thing working, it should come as no surprise that I ended up with a memory leak. I didn't actually have time to fix it before the big day either, so I had the demo on the big monitor while I kept an eye on the memory usage of my laptop on my laptop screen!

A photo of my demo up and running on a PC with a PS4 controller on a wooden desk. An Entroware laptop sits partially obscured by a desktop PC monitor, the latter of which has the demo full screen.

(Above: A photo of my demo in action.... I kept an eye on the memory graph the taskbar on my laptop the whole time. It only crashed once!)

Anyone who has done anything with graphics and game engines probably suspects where the memory leak was already. When rendering the text above each point with a DynamicTexture, I didn't reuse the instance when the player moved, leading to a build-up of unused textures in memory that would eventually crash the machine. After the day was over, I had time to sit down and implement a pool to re-use these textures over and over again, which didn't take nearly as long as I thought it would.

Gamepad support

You would think that being a well known game engine that Babylon would have working gamepad support. The documentation even suggests as such, but sadly this is not the case. When I discovered that gamepad support was broken in Babylon (at least for my PS4 controller), I ended up monkeypatching Babylon to disable the inbuilt support (it caused a crash even when disabled O.o) and then hacking together a custom implementation.

This custom implementation is actually quite flexible, so if I ever have some time I'd like to refactor it into its own npm package. Believe it or not I tried multiple other npm packages for wrapping the Gamepad API, and none worked reliably (it's a polling API, which can make designing an efficient and stable wrapper an interesting challenge).

To do that though I would need to have some other controllers to test with, as currently it's designed only for the PS4 dualshock controller I have on hand. Some time ago I initially purchased an Xbox 360 controller wanting something that worked out of the box with Linux, but it didn't work out so well so I ended up selling it on and buying a white PS4 dualshock controller instead (pictured below).

I'm really impressed with how well the PS4 dualshock works with Linux - it functions perfectly out of the box in the browser (useful test website) just fine, and even appears to have native Linux mainline kernel support which is a big plus. The little touchpad on it is cute and helpful in some situations too, but most of the time you'd use a real pointing device.

A white PS4 dualshock controller.

(Above: A white PS4 dualshock controller.)

How does it fit in a browser anyway?!

Good question. The primary answer to this is the magic of esbuild: a magical build tool that packages your Javascript and CSS into a single file. It can also handle other associated files like images too, and on top of that it's suuuper easy to use. It tree-shakes by default, and just all-around a joy to use.

Putting it to use resulted in my ~1.5K lines of code (wow, I thought it was more than that) along with ~300K lines in libraries being condensed into a single 4MiB .js and a 0.7KiB .css file, which I could serve to the browser along with the the main index.html file. It's event really easy to implement subresource integrity, so I did that just for giggles.

Datasets, an origin story

Using the Fetch API, I could fetch a pre-prepared dataset from the server, unpack it, and do cool things with it as described above. The dataset itself was prepared using a little Python script I wrote (source).

The script uses GloVe to vectorise words (I think I used 50 dimensions since that's what fit inside my laptop at the time), and then UMAP (paper, good blog post on why UMAP is better than tSNE) to do dimensionality reduction down to 3 dimensions, whilst still preserving global structure. Judging by the experiences we had on the day, I'd say it was pretty accurate, if not always obvious why given words were related (more on this why this is the case in a separate post).

My social media data, plotted in 2D with PCA (left), tSNE (centre), and UMAP (right). Points are blue against a white background, plotted with the Python datashader package.

_(Above: My social media data, plotted in 2D with PCA (left), tSNE (centre), and UMAP (right). Points are blue against a white background, plotted with the Python datashader package.)_

I like Javascript, but I had the code written in Python due to prior research, so I just used Python (looking now there does seem to be a package that implementing UMAP in JS, so I might look at that another time). The script is generic enough that I should be able to adapt it for other projects in the future to do similar kinds of analyses.

For example, if I were to look at a comparative analysis of e.g. language used by social media posts from different hashtags or something, I could use the same pipeline and just label each group with a different colour to see the difference between the 2 visually.

The data itself comes from 2 different places, depending on where you see the demo. If you were luck enough to see it in person, then it's directly extracted from my social media data. The online one comes from page abstracts from various Wikipedia language dumps to preserve privacy of the social media dataset, just in case.

With the data converted, the last piece of the puzzle is that of how it ends up in the browser. My answer is a gzip-compressed headerless tab-separated-values file that looks something like this (uncompressed, of course):

cat    -10.147051      2.3838716       2.9629934
apple   -4.798643       3.1498482       -2.8428414
tree -2.1351748      1.7223179       5.5107193

With the data stored in this format, it was relatively trivial to load it into the browser, decompressed as mentioned previously, and then display it with Babylon.js. There's also room here to expand and add additional columns later if needed, to e.g. control the colour of each point, label each word with a group, or something else.

Conclusion

We've pulled the demo apart piece by piece, and seen at a high level how it's put together and the decisions I made while implementing it. We've seen how I implemented the graphics - aided by Babylon.js and a clever hack. I've explained how I optimised the location polling using achieve real-time performance with an octree, and how reusing textures is very important. Finally, we took a brief look at the dataset and where it came from.

In the next post, we'll take a look at how to monkeypatch an npm package and when you'd want to do so. In a later post, we'll look at the research behind the demo, what makes it tick, what I learnt while building and showing it off, and how that fits in with the wider field from my perspective.

Until then, I'll see you in the next post!

Edit 2023-11-30: Oops! I forgot to link to the source code....! If you'd like to take a gander at the source code behind the demo, you can find it here: https://github.com/sbrl/research-smflooding-vis

My Hull Science Festival Demo: How do AIs understand text?

Banner showing gently coloured point clouds of words against a dark background on the left, with the Humber Science Festival logo, fading into a screenshot of the attract screen of the actual demo on the right.

Hello there! On Saturday 9th September 2023, I was on the supercomputing stand for the Hull Science Festival with a cool demo illustrating how artificial intelligences understand and process text. Since then, I've been hard at work tidying that demo up, and today I can announce that it's available to view online here on my website!

This post is a general high-level announcement post. A series of technical posts will follow on the nuts and bolts of both the theory behind the demo and the actual code itself and how its put together, because it's quite interesting and I want to talk about it.

I've written this post to serve as a foreword / quick explanation of what you're looking at (similar to the explanation I gave in person), but if you're impatient you can just find it here.

All AIs currently developed are essentially complex parametrised mathematical models. We train these models by updating their parameters little by little until the output of the model is similar to the output of some ground truth label.

In other words, and AI is just a bunch of maths. So how does it understand text? The answer to this question lies in converting text to numbers - a process often called 'word embedding'.

This is done by splitting an input sentence into words, and then individually converting each word into a series of numbers, which is what you will see in the demo at the link below - just convert with some magic to 3 dimensions to make it look fancy.

Similar sorts of words will have similar sorts of numbers (or positions in 3D space in the demo). As an example here, at the science festival we found a group of footballers, a group of countries, and so on.

In the demo below, you will see clouds of words processed from Wikipedia. I downloaded a bunch of page abstracts for Wikipedia in a number of different languages (source), extracted a list of words, converted them to numbers (GloVeUMAP), and plotted them in 3D space. Can you identify every language displayed here?


Find the demo here: https://starbeamrainbowlabs.com/labs/research-smflooding-vis/

A screenshot of the initial attract screen of the demo. A central box allows one to choose a file to load, with a large load button directly beneath it. The background is a blurred + bloomed screenshot of a point cloud from the demo itself.

Find the demo here: https://starbeamrainbowlabs.com/labs/research-smflooding-vis/


If you were one of the lucky people to see my demo in person, you may notice that this online demo looks very different to the one I originally presented at the science festival. That's because the in-person demo uses data from social media, but this one uses data from Wikipedia to preserve privacy, just in case.

I hope you enjoy the demo! Time permitting, I will be back with some more posts soon to explain how I did this and the AI/NLP theory behind it at a more technical level. Some topics I want to talk about, in no particular order:

  • General technical outline of the nuts and bolts of how the demo works and what technologies I used to throw it together
  • How I monkeypatched Babylon.js's gamepad support
  • A detailed and technical explanation of the AI + NLP theory behind the demo, the things I've learnt about word embeddings while doing it, and what future research could look like to improve word embeddings based on what I've learnt
  • Word embeddings, the options available, how they differ, and which one to choose.

Until next time, I'll leave you with 2 pictures I took on the day. See you in the next post!

Edit 2023-11-30: Oops! I forgot to link to the source code....! If you'd like to take a gander at the source code behind the demo, you can find it here: https://github.com/sbrl/research-smflooding-vis

A photo of my demo up and running on a PC with a PS4 controller on a wooden desk. An Entroware laptop sits partially obscured by a desktop PC monitor, the latter of which has the demo full screen.

(Above: A photo of my demo in action!)

A photo of some piles of postcards arranged on a light wooden desk. My research is not shown, but visuals from other researchers' projects are printed, such as microbiology to disease research to jellyfish galaxies.

(Above: A photo of the postcards on the desk next to my demo. My research is not shown, but visuals from other researchers' projects are printed, with everything from microbiology to disease research to jellyfish galaxies.)

Visualising Tensorflow model summaries

It's no secret that my PhD is based in machine learning / AI (my specific title is "Using Big Data and AI to dynamically map flood risk"). Recently a problem I have been plagued with is quickly understanding the architecture of new (and old) models I've come across at a high level. I could read the paper a model comes from in detail (and I do this regularly), but it's much less complicated and much easier to understand if I can visualise it in a flowchart.

To remedy this, I've written a neat little tool that does just this. When you're training a new AI model, one of the things that it's common to print out is the summary of the model (I make it a habit to always print this out for later inspection, comparison, and debugging), like this:

model = setup_model()
model.summary()

This might print something like this:

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
embedding (Embedding)        (None, 500, 16)           160000    

lstm (LSTM)                  (None, 32)                6272      

dense (Dense)                (None, 1)                 33        
=================================================================
Total params: 166,305
Trainable params: 166,305
Non-trainable params: 0
_________________________________________________________________

(Source: ChatGPT)

This is just a simple model, but it is common for larger ones to have hundreds of layers, like this one I'm currently playing with as part of my research:

(Can't see the above? Try a direct link.)

Woah, that's some model! It must be really complicated. How are we supposed to make sense of it?

If you look closely, you'll notice that it has a Connected to column, as it's not a linear tf.keras.Sequential() model. We can use that to plot a flowchart!

This is what the tool I've written generates, using a graphing library called nomnoml. It parses the Tensorflow summary, and then compiles it into a flowchart. Here's a sample:

It's a purely web-based tool - no data leaves your client. Try it out for yourself here:

https://starbeamrainbowlabs.com/labs/tfsummaryvis/

For the curious, the source code for this tool is available here:

https://git.starbeamrainbowlabs.com/sbrl/tfsummaryvis

AI encoders demystified

When beginning the process of implementing a new deep learning / AI model, alongside deciding on the input and outputs of the model and converting your data to tfrecord files (file extension: .tfrecord.gz; you really should, the performance boost is incredible) is designing the actual architecture of the model itself. It might be tempting to shove a bunch of CNN layers at the problem to make it go away, but in this post I hope to convince you to think again.

As with any field, extensive research has already been conducted, and when it comes to AI that means that the effective encoding of various different types of data with deep learning models has already been carefully studied.

To my understanding, there are 2 broad categories of data that you're likely to encounter:

  1. Images: photos, 2D maps of sensor readings, basically anything that could be plotted as an image
  2. Sequenced data: text, sensor readings with only a temporal and no spatial dimension, audio, etc

Sometimes you'll encounter a combination of these two. That's beyond the scope of this blog post, but my general advice is:

  • If it's 3D, consider it a 2D image with multiple channels to save VRAM
  • Go look up specialised model architectures for your problem and how well they work/didn't work
  • Use a Convolutional variant of a sequenced model or something
  • Pick one of these encoders and modify it slightly to fix your use case

In this blog post, I'm going to give a quick overview of the various state of the art encoders for the 2 categories of data I've found on my travels so far. Hopefully, this should give you a good starting point for building your own model. By picking an encoder from this list as a starting point for your model and reframing your problem just a bit to fit, you should find that you have not only a much simpler problem/solution, but also a more effective one too.

I'll provide links to the original papers for each model, but also links to any materials I've found that I found helpful in understanding how they work. This can be useful if you find yourself in the awkward situation of needing to implement them yourself, but it's also generally good to have an understanding of the encoder you ultimately pick.

I've read a number of papers, but there's always the possibility that I've missed something. If you think this is the case, please comment below. This is especially likely if you are dealing with audio data, as I haven't looked into handling that too much yet.

Finally, also out of scope of this blog post are the various ways of framing a problem for an AI to understand it better. If this interests you, please comment below and I'll write a separate post on that.

Images / spatial data

ConvNeXt: Image encoders are a shining example of how a simple stack of CNNs can be iteratively improved through extensive testing, and in no encoder is this more apparent than in ConvNeXt.

In 1 sentence, a ConvNeXt model can be summarised as "a ResNet but with the features of a transformer". Researchers took a bog-standard ResNet, and then iteratively tested and improved it by adding various features that you'd normally find in a transformer, which results in the model performant (*read: had the highest accuracy) encoder when compared to the other 2 on this list.

ConvNeXt is proof that CNN-based models can be significantly improved by incorporating more than just a simple stack of CNN layers.

Vision Transformer: Born from when someone asked what should happen if they tried to put an image into a normal transformer (see below), Vision Transformers are a variant of a normal transformer that handles images and other spatial data instead of sequenced data (e.g. sentences).

The current state of the art is the Swin Transformer to the best of my knowledge. Note that ConvNeXt outperforms Vision Transformers, but the specifics of course depend on your task.

ResNet: Extremely popular, you may have heard of the humble ResNet before. It was originally made a thing when someone wanted to know what would happen if they took the number of layers in a CNN-based model to the extreme. It has a 'residual connection' between blocks of layers, which avoids the 'vanishing gradient problem' which in short is when your model is too deep, and when it backpropagates the error the gradient it adjusts the weight by becomes so small that it has hardly any effect.

Since this model was invented, skip connections (or 'residual connections', as they are sometimes known) have become a regular feature in all sorts of models - especially those deeper than just a couple of layers.

Despite this significant advancement though, I recommend using a ConvNeXt encoder instead of images - it's better than and the architecture more well tuned than a bog standard ResNet.

Sequenced Data

Transformer: The big one that is quite possibly the most famous encoder-decoder ever invented. The Transformer replaces the LSTM (see below) for handling sequenced data. It has 2 key advantages over an LSTM:

  • It's more easily paralleliseable, which is great for e.g. training on GPUs
  • The attention mechanism it implements revolutionises the performance of like every network ever

The impact of the Transformer on AI cannot be overstated. In short: If you think you want an LSTM, use a transformer instead.

  • Paper: Attention is all you need
  • Explanation: The Illustrated Transformer
  • Implementation: Many around, but I have yet to find a simple enough one, so I implemented it myself from scratch. While I've tested the encoder and I know it works, I have yet to fully test the decoder. Once I have done this, I will write a separate blog post about it and add the link here.

LSTM: Short for Long Short-Term Memory, LSTMs were invented in 1997 to solve the vanishing gradient problem, in which gradients when backpropagating back through recurrent models that handle sequenced data shrink to vanishingly small values, rendering them ineffective at learning long-term relationships in sequenced data.

Superceded by the (non recurrent) Transformer, LSTMs are a recurrent model architecture, which means that their output feeds back into themselves. While this enables some unique model architectures for learning sequenced data (exhibit A), it also makes them horrible at being parallelised on a GPU (the dilated LSTM architecture attempts to remedy this, but Transformers are just better). The only thing that stands them apart from the Transformer is that the Transformer is somehow not built-in into Tensorflow as standard yet, whereas LSTMs are.

Just like Vision Transformers adapted the Transformer architecture for multidimensional data, so too are Grid LSTMs to normal LSTMs.

Conclusion

In summary, for images encoders in priority order are: ConvNeXt, Vision Transformer (Swin Transformer), ResNet, and for text/sequenced data: Transformer, LSTM.

We've looked at a small selection of model architectures for handling a variety of different data types. This is not an exhaustive list, however. You might have another awkward type of data to handle that doesn't fit into either of these categories - e.g. specialised models exist for handling audio, but the general rule of thumb is an encoder architecture probably already exists for your use-case - even if I haven't listed it here.

Also of note are alternative use cases for data types I've covered here. For example, if I'm working with images I would use a ConvNeXt, but if model prediction latency and/or resource consumption mattered I would consider using a MobileNet](https://www.tensorflow.org/api_docs/python/tf/keras/applications/mobilenet_v3), which while a smaller model is designed for producing rapid predictions in lower resource environments - e.g. on mobile phones.

Finally, while these are encoders decoders also exist for various tasks. Often, they are tightly integrated into the encoder. For example, the U-Net is designed for image segmentation. Listing these is out of scope of this article, but if you are getting into AI to solve a specific problem (as is often the case), I strongly recommend looking to see if an existing model/decoder architecture has been designed to solve your particular problem. It is often much easier to adjust your problem to fit an existing model architecture than it is to design a completely new architecture to fit your particular problem (trust me, I've tried this already and it was a Bad Idea).

The first one you see might even not be the best / state of the art out there - e.g. Transformers are better than the more widely used LSTMs. Surveying the landscape for your particular task (and figuring out how to frame it in the first place) is critical to the success of your model.

Easily write custom Tensorflow/Keras layers

At some point when working on deep learning models with Tensorflow/Keras for Python, you will inevitably encounter a need to use a layer type in your models that doesn't exist in the core Tensorflow/Keras for Python (from here on just simply Tensorflow) library.

I have encountered this need several times, and rather than e.g. subclassing tf.keras.Model, there's a much easier way - and if you just have a simple sequential model, you can even keep using tf.keras.Sequential with custom layers!

Background

First, some brief background on how Tensorflow is put together. The most important thing to remember is Tensorflow likes very much to compile things into native code using what we can think of as an execution graph.

In this case, by execution graph I mean a directed graph that defines the flow of information through a model or some other data processing pipeline. This is best explained with a diagram:

A simple stack of Keras layers illustrated as a directed graph

Here, we define a simple Keras AI model for classifying images which you might define with the functional API. I haven't tested this model - it's just to illustrate an example (use something e.g. like MobileNet if you want a relatively small model for image classification).

The layer stack starts at the top and works it's way downwards.

When you call model.compile(), Tensorflow complies this graph into native code for faster execution. This is important, because when you define a custom layer, you may only use Tensorflow functions to operate on the data, not Python/Numpy/etc ones.

You may have already encountered this limitation if you have defined a Tensorflow function with tf.function(some_function).

The reason for this is the specifics of how Tensorflow compiles your model. Now consider this graph:

A graph of Tensorflow functions

Basic arithmetic operations on tensors as well as more complex operators such as tf.stack, tf.linalg.matmul, etc operate on tensors as you'd expect in a REPL, but in the context of a custom layer or tf.function they operate on not a real tensor, but symbolic ones instead.

It is for this reason that when you implement a tf.function to use with tf.data.Dataset.map() for example, it only gets executed once.

Custom layers for the win!

With this in mind, we can relatively easily put together a custom layer. It's perhaps easiest to show a trivial example and then explain it bit by bit.

I recommend declaring your custom layers each in their own file.

import tensorflow as tf

class LayerMultiplier(tf.keras.layers.Layer):
    def __init__(self, multiplier=2, **kwargs):
        super(LayerMultiplier, self).__init__(**kwargs)

        self.param_multiplier = multiplier
        self.tensor_multiplier = tf.constant(multiplier, dtype=tf.float32)

    def get_config(self):
        config = super(LayerMultiplier, self).get_config()

        config.update({
            "multiplier": self.param_multiplier
        })

        return config

    def call(self, input_thing, training, **kwargs):
        return input_thing * self.tensor_multiplier

Custom layers are subclassed from tf.keras.layers.Layer. There are a few parts to a custom layer:

The constructor (__init__) works as you'd expect. You can take in custom (hyper)parameters (which should not be tensors) here and use then to control the operation of your custom layer.

get_config() must ultimately return a dictionary of arguments to pass to instantiate a new instance of your layer. This information is saved along with the model when you save a model e.g. with tf.keras.callbacks.ModelCheckpoint in .hdf5 mode, and then used when you load a model with tf.keras.models.load_model (more on loading a model with custom layers later).

A paradigm I usually adopt here is setting self.param_ARG_NAME_HERE fields in the constructor to the value of the parameters I've taken in, and then spitting them back out again in get_config().

call() is where the magic happens. This is called when you call model.compile() with a symbolic tensor which stands in for the shape of the real tensor to build an execution graph as explained above.

The first argument is always the output of the previous layer. If your layer expects multiple inputs, then this will be an array of (potentially symbolic) tensors rather then a (potentially symbolic) tensor directly.

The second argument is whether you are in training mode or not. You might not be in training mode if:

  1. You are spinning over the validation dataset
  2. You are making a prediction / doing inference
  3. Your layer is frozen for some reason

Sometimes you may want to do something differently if you are in training mode vs not training mode (e.g. dataset augmentation), and Tensorflow is smart enough to ensure this is handled as you'd expect.

Note also here that I use a native multiplication with the asterisk * operator. This works because Tensorflow tensors (whether symbolic or otherwise) overload this and other operators so you don't need to call tf.math.multiply, tf.math.divide, etc explicitly yourself, which makes your code neater.

That's it, that's all you need to do to define a custom layer!

Using and saving

You can use a custom layer just like a normal one. For example, using tf.keras.Sequential:

import tensorflow as tf

from .components.LayerMultiply import LayerMultiply

def make_model(batch_size, multiplier):
    model = tf.keras.Sequential([
        tf.keras.layers.Dense(96),
        tf.keras.layers.Dense(32),
        LayerMultiply(multiplier=5)
        tf.keras.layers.Dense(10, activation="softmax"),
    ])
    model.build(input_shape=tf.TensorShape([ batch_size, 32 ]))
    model.compile(
        optimizer="Adam",
        loss=tf.keras.losses.SparseCategoricalCrossentropy(),
        metrics=[
            tf.keras.losses.SparseCategoricalAccuracy()
        ]
    )
    return model

The same goes here for the functional API. I like to put my custom layers in a components directory, but you can put them wherever you like. Again here, I haven't tested the model at all, it's just for illustrative purposes.

Saving works as normal, but for loading a saved model that uses a custom layer, you need to provide a dictionary of custom objects:

loaded_model = tf.keras.models.load_model(filepath_checkpoint, custom_objects={
    "LayerMultiply": LayerMultiply,
})

If you have multiple custom layers, define all the ones you use here. It doesn't matter if you define extra it seems, it'll just ignore the ones that aren't used.

Going further

This is far from all you can do. In custom layers, you can also:

  • Instantiate sublayers or models (tf.keras.Model inherits from tf.keras.layers.Layer)
  • Define custom trainable weights (tf.Variable)

Instantiating sublayers is very easy. Here's another example layer:

import tensorflow as tf

class LayerSimpleBlock(tf.keras.layers.Layer):
    def __init__(self, units, **kwargs):
        super(LayerSimpleBlock, self).__init__(**kwargs)

        self.param_units = units

        self.block = tf.keras.Sequential([
            tf.keras.layers.Dense(self.param_units),
            tf.keras.layers.Activation("gelu")
            tf.keras.layers.Dense(self.param_units),
            tf.keras.layers.LayerNormalization()
        ])

    def get_config(self):
        config = super(LayerSimpleBlock, self).get_config()

        config.update({
            "units": self.param_units
        })

        return config

    def call(self, input_thing, training, **kwargs):
        return self.block(input_thing, training=training)

This would work with a single sublayer too.

Custom trainable weights are also easy, but require a bit of extra background. If you're reading this post, you have probably heard of gradient descent. The specifics of how it works are out of scope of this blog post, but in short it's the underlying core algorithm deep learning models use to reduce error by stepping bit by bit towards lower error.

Tensorflow goes looking for all the weights in a model during the compilation process (see the explanation on execution graphs above) for you, and this includes custom weights.

You do, however, need to mark a tensor as a weight - otherwise Tensorflow will assume it's a static value. This is done through the use of tf.Variable:

tf.Variable(name="some_unique_name", initial_value=tf.random.uniform([64, 32]))

As far as I've seen so far, tf.Variable()s need to be defined in the constructor of a tf.keras.layers.Layer, for example:

import tensorflow as tf

class LayerSimpleBlock(tf.keras.layers.Layer):
    def __init__(self, **kwargs):
        super(LayerSimpleBlock, self).__init__(**kwargs)

        self.weight = tf.Variable(name="some_unique_name", initial_value=tf.random.uniform([64, 32]))

    def get_config(self):
        config = super(LayerSimpleBlock, self).get_config()
        return config

    def call(self, input_thing, training, **kwargs):
        return input_thing * weight

After you define a variable in the constructor, you can use it like a normal tensor - after all, in Tensorflow (and probably other deep learning frameworks too), tensors don't always have to hold an actual value at the time of execution as I explained above (I call tensors that don't contain an actual value like this symbolic tensors, since they are like stand-ins for the actual value that gets passed after the execution graph is compiled).

Conclusion

We've looked at defining custom Tensorflow/Keras layers that you can use without giving tf.keras.Sequential() or the functional API. I've shown how by compiling Python function calls into native code using an execution graph, many orders of magnitude of performance gains can be obtained, fully saturating GPU usage.

We've also touched on defining custom weights in custom layers, which can be useful depending on what you're implementing. As a side note, should you need a weight in a custom loss function, you'll need to define it in the constructor of a tf.keras.layers.Layer and then pull it out and pass it to your subclass of tf.keras.losses.Loss.

By defining custom Tensorflow/Keras layers, we can implement new cutting-edge deep learning logic that are easy to use. For example, I have implemented a Transformer with a trio of custom layers, and CBAM: Convolutional Block Attention Module also looks very cool - I might implement it soon too.

I haven't posted a huge amount about AI / deep learning on here yet, but if there's any topic (machine learning or otherwise) that you'd like me to cover, I'm happy to consider it - just leave a comment below.

The plan to caption and index images

Something that has been on my mind for a while are the photos that I take. At last count on my NAS I have 8564 pictures I have taken so far since I first got a phone to take them with, and many more belonging to other family members.

I have blogged before about a script I've written that automatically processes photos graphs and files them in by year and month. It fixes the date taken, set the thumbnail for rapid preview loading, automatically rotates them to be the right way up, losslessly optimises them, and more.

The one thing it can't do though is to help me locate a specific photo I'm after, so given my work with AI recently I have come up with a plan to do something about this, and I want to blog about it here.

By captioning the images with an AI, I plan to index the captions (and other image metadata) and have a web interface in the form of a search engine. In this blog post, I'm going to outline the AI I intend to use, and the architecture of the image search engine I have already made a start on implementing.

AI for image captioning

The core AI to do image captioning will be somewhat based on work I've done for my PhD. The first order of business was finding a dataset to train on, and I stumbled across Microsoft's Common Objects in Context dataset. The next and more interesting part was to devise a model architecture that translate an image into text.

When translating 1 thing (or state space) into another in AI, it is generally done with an encoder-decoder architecture. In my case here, that's an encoder for the image - to translate it into an embedded feature space - and a decoder to turn that embedded feature space into text.

There are many options for these - especially for encoding images - which I'll look at first. While doing my PhD, I've come across many different encoders for images, which I'd roughly categorise into 2 main categories:

Since the transformer model was invented, they have been widely considered to be the best option. Swin Transformers adapt this groundbreaking design for images - transformers originally handled text - from what I can tell better than the earlier Vision Transformer architecture.

On the other side, a number of encoders were invented before transformers were a thing - the most famous of which was ResNet (I think I have the right paper), which was basically just a bunch of CNN layers stacked on top of one another with a few extra bits like normalisation and skip connections.

Recently though, a new CNN-based architecture that draws inspiration from the strong points of transformers - and it's called ConvNeXt. Based on the numbers in the paper, it even beats the swin transformer model mentioned earlier. Best of all, it's much simpler in design so it makes it relatively easy to implement. It is this model architecture I will be using.

For the text, things are both straight forward - the model architecture I'll be using is a transformer (of course - I even implemented it myself from scratch!) - but the trouble is representation. Particularly the representation of the image caption we want the model to predict.

There are many approaches to this problem, but the one I'm going to try first is a word-based solution using one-hot encoding. There are about 27K different unique words in the dataset, so I've assigned each one a unique number in a dictionary file. Then, I can turn this:

[ "a", "cat", "sat", "on", "a", "mat" ]

....into this:

[ 0, 1, 2, 3, 0, 4 ]

...then, the model would predict something like this:

[
    [ 1, 0, 0, 0, 0, 0, ... ],
    [ 0, 1, 0, 0, 0, 0, ... ],
    [ 0, 0, 1, 0, 0, 0, ... ],
    [ 0, 0, 0, 1, 0, 0, ... ]
    [ 1, 0, 0, 0, 0, 0, ... ]
    [ 0, 0, 0, 0, 1, 0, ... ]
]

...where each sub-array is a word.

This will as you might suspect use a lot of memory - especially with 27K words in the dictionary. By my calculations, with a batch size of 64 and a maximum caption length of 25, each output prediction tensor will use a whopping 172.8 MiB memory as float32, or 86.4 MiB memory as float16 (more on memory usage later).

I'm considering a variety of techniques to combat this if it becomes an issue. For example, reducing the dictionary size by discarding infrequently used words.

Another option would be to have the model predict GloVe vectors as an output and then compare the output to the GloVe dictionary to tell which one to pick. This would come with it's own set of problems however, like lots of calculations to compare each word to every word in the dictionary.

My final thought was that I could maybe predict individual characters instead of full words. There would be more items in the sequence predicted, but each character would only have up to 255 choices (probably more like 36-ish), potentially saving memory.

I have already implemented this AI - I just need to debug and train it now. To summarise, here's a diagram:

The last problem with the AI though is memory usage. I plan on eventually running the AI on a raspberry pi, so much tuning will be required to reduce memory usage and latency as much I can. In particular, I'll be trying out quantisating my model and writing the persistent daemon to use Tensorflow Lite to reduce memory usage. Models train using the float32 data type - which uses 32 bits per value, but quantising it after training to use float16 (16 bits / value) or even uint8 (8 bits / value) would significantly reduce memory usage.

Search engine and indexing

The second part of this is the search engine. The idea here is to index all my photos ahead of time, and then have a web interface I can use to search and filter them. The architecture I plan on using to achieve this is rather complicated, and best explained with a diagram:

The backend I have chosen for the index is called meilisearch. It's written in Rust, and provides advanced as-you-type search functionality. This is for 2 reasons:

  1. While I'd love to implement my own, meilisearch is an open source project where they have put in more hours into making it cool than I ever would be able to
  2. Being a separate daemon means I can schedule it on my cluster as a separate task, which potentially might end up on a different machine

With this in mind, the search engine has 2 key parts to it: the crawler / indexer, and the HTTP server that serves the web interface. The web interface will talk to meilisearch to perform searches (not directly; requests will be proxied and transformed).

The crawler will periodically scan the disk for new, updated, and deleted files, and pass them on to the indexer queue. The indexer will do 4 things:

  1. Caption the image, by talking to a persistent Python child process via Inter Process Communication (IPC) - captions will be written as EXIF data to images
  2. Thumbnail images and store them in a cache (perhaps some kinda key-value store, as lots of little files on disk would be a disaster for disk space)
  3. Extract EXIF (and other) metadata
  4. Finally, push the metadata to meilisearch for indexing

Tasks 2 and 3 can be done in parallel, but the others will need to be done serially - though multiple images can of course be processed concurrently. I anticipate much asynchronous code here, which I'm rather looking forward to finishing writing :D

I already have a good start on the foundation of the search engine here. Once I've implemented enough that it's functional, I'll open source everything.

To finish this post, I have a mockup screenshot of what the main search page might look like:

Obviously the images are all placeholders (append ?help to this URL see the help page) for now and I don't yet have a name for it (suggestions in the comments are most welcome!), but the rough idea is there.

PhD Aside 2: Jupyter Lab / Notebook First Impressions

Hello there! I'm back with another PhD Aside blog post. In the last one, I devised an extremely complicated and ultimately pointless mechanism by which multiple Node.js processes can read from the same file handle at the same time. This post hopefully won't be quite as useless, as it's a cross with the other reviews / first impressions posts I've made previously.

I've had Jupyter on my radar for ages, but it's only very recently that I've actually given it a try. Despite being almost impossible to spell (though it does appear to be getting easier with time), both it's easy to install and extremely useful when plotting visualisations, so I wanted to talk about it here.

I tried Jupyter Lab, which is apparently more complicated than Jupyter Notebook. Personally though I'm not sure I see much of a difference, aside from a file manager sidebar in Jupyter Lab that is rather useful.

A Jupyter Lab session of mine, in which I was visualising embeddings from a pretrained CLIP model.

(Above: A Jupyter Lab session of mine, in which I was visualising embeddings from a pretrained CLIP model.)

Jupyter Lab is installed via pip (pip3 for apt-based systems): https://jupyter.org/install. Once installed, you can start a server with jupyter-lab in a terminal (or command line), and then it will automatically open a new tab in your browser that points to the server instance (http://localhost:8888/ by default).

Then, you can open 1 or more Jupyter Notebooks, which seem to be regular files (e.g. Javascript, Python, and more) but are split into 'cells', which can be run independently of one another. While these cells are usually run in order, there's nothing to say that you can't run them out of order, or indeed the same cell over and over again as you prototype a graph.

The output of each cell is displayed directly below it. Be that a console.log()/print() call or a graph visualisation (see the screenshot above), it seems to work just fine. It also saves the output of a cell to disk alongside the code in the Jupyter Notebook, can be a double-edged sword: On the one hand, it's very useful to have the plot and other output be displayed to remind you what you were working on, but on the other hand if the output somehow contains sensitive data, then you need to remember to clear it before saving & committing to git each time, which is a hassle. Similarly, every time the output changes the notebook file on disk also changes, which can result in unnecessary extra changes committed to git if you're not careful.

In the same vein, I have yet to find a way to define a variable in a notebook file whose value is not saved along with the notebook file, which I'd rather like since the e.g. tweets I work with for the social media side of my PhD are considered sensitive information, and so I don't want to commit them to a git repository which will no doubt end up open-source.

You can also import functions and classes from other files. Personally, I see Jupyter notebooks to be most useful when used in conjunction with an existing codebase: while you can put absolutely everything in your Jupyter notebook, I wouldn't recommend it as you'll end up with spaghetti code that's hard to understand or maintain - just like you would in a regular codebase in any other language.

Likewise, I wouldn't recommend implementing an AI model in a Jupyter notebook directly. While you can, it makes it complicated to train it on a headless server - which you'll likely want to do if you want to train a model at any scale.

The other minor annoyance is that by using Jupyter you end up forfeiting thee code intelligence of e.g. Atom or Visual Studio Code, which is a shame since a good editor can e.g. check syntax on the fly, inform you of unused variables, provide autocomplete, etc.

These issues aside, Jupyter is a great fit for plotting visualisations due to the very short improve → rerun → inspect/evaluate output loop. It's also a good fit for writing tutorials I suspect, as it apparently has support for markdown cells too. At some point, I may try writing a tutorial in Jupyter notebook, rendering it to regular markdown, and posting it here.

Art by Mythdael