Stardust | Starbeamrainbowlabs

Defining AI: Sequenced models || LSTMs and Transformers

Hi again! I'm back for more, and I hope you are too! First we looked at word embeddings, and then image embeddings. Another common way of framing tasks for AI models is handling sequenced data, which I'll talk about briefly today.

Banner showing the text 'Defining AI' on top on translucent white vertical stripes against a voronoi diagram in white against a pink/purple background. 3 progressively larger circles are present on the right-hand side.

Sequenced data can mean a lot of things. Most commonly that's natural language, which we can split up into words (tokenisation) and then bung through some word embeddings before dropping it through some model.

That model is usually specially designed for processing sequenced data, such as an LSTM (or it's little cousin the GRU, see that same link). An LSTM is part of a class of models called Recurrent Neural Networks, or RNNs. These models work by feeding their output back in to themselves again along with the next element of input, thereby processing a sequence of items. The output bit that feeds back in is basically the memory of the thing that remembers stuff across multiple items in the sequence. The model can decide to add or remove things from this memory at each iteration, which is how it learns.

The problem with this class of model is twofold: firstly, it processes everything serially which means we can't compute it in parallel, and secondly if the sequence of data it processes is long enough it will forget things from earlier in the sequence.

In 2017 a model was invented that solves at least 1 of these issues: the Transformer. Instead of working like a recurrent network, a Transformer processes an entire sequence at once, and encodes a time signal (the 'positional embedding') into the input so it can keep track of what was where in the sequence - a downside to processing in parallel means you don't know which element of the sequence was where.

The transformer model also brought with it the concept of 'attention', which is where the model can decide what parts of the input data are important at each step. This helps the transformer to focus on the bits of the data that are relevant to the task at hand.

Since 2017, the number of variants of the original transformer has exploded, which stands testament to the popularity of this model architecture.

Regarding inputs and outputs, most transformers will take an input in the same shape as the word embeddings in the first post in this series, and will spit it out in the same shape - just potentially with a different shape to the embedding dimension:

A diagram explaining how a transformer works. A series of sine waves are added as a positional embedding to the data before it goes in.

In this fashion, transformers are only limited by memory requirements and computational expense with respect to sequence lengths, which has been exploited in some model designs to convince a transformer-style model to learn to handle some significantly long sequences.

This is just a little look at sequenced data modelling with LSTMs and Transformers. I seem to have my ordering a bit backwards here, so in the next few posts we'll get down to basics and explain some concepts like Tensors and shapes, loss functions and backpropagation, attention, and how AI models are put together.

Are there any AI-related concepts or questions you would like answering? Leave a comment below and I'll write another post in this series to answer your question.

Sources and further reading

The Illustrated Transformer
Illustrated Guide to LSTMs and GRUs: A step by step explanation
Attention is All You Need - the original paper about the transformer model architecture
Long Short-Term Memory - the original paper about the LSTM
Transformer models: an introduction and catalog
The Unreasonable Effectiveness of Recurrent Neural Networks - older post from 2015 but a very cool earlier character-based text generation model. The hallucination by that model is a great example of why modern Transformer-based text generation models (a topic for another time) like ChatGPT have a similar issue.

Defining AI: Image segmentation

Welcome back to Defining AI! This series is all about defining various AI-related terms in short-and-sweet blog posts.

In the last post, we took a quick look at word embeddings, which is the key behind how AI models understand text. In this one, we're going to investigate image segmentation.

Image segmentation is a particular way of framing a learning task for an AI model that takes an image as an input, and then instead of classifying the image into a given set of categories, it classifies every pixel by the category each one belongs to.

The simplest form of this is what's called semantic segmentation, where we classify each pixel via a given set of categories - e.g. building, car, sky, road, etc if we were implementing a segmentation model for some automated vehicle.

If you've been following my PhD journey for a while, you'll know that it's not just images that can be framed as an 'image' segmentation task: any data that is 2D (or can be convinced to pretend to be 2D) can be framed as an image segmentation task.

The output of an image segmentation model is basically a 3D map. This map will obviously have the width and height, but then also have an extra dimension for the channel. This is best explained with an image:

(Above: a diagram explaining how the output of an image segmentation model is formatted. See below explanation. Extracted from my work-in-progress thesis!)

Essentially, each value in the 'channel' dimension will be the probability that pixel is that class. So, for example, a single pixel in a model for predicting the background and foreground of an image might look like this:

[ 0.3, 0.7 ]

....if we consider these classes:

[ background, foreground ]

....then this pixel has a 30% change of being a background pixel, and a 70% change of being a foreground pixel - so we'd likely assume it's a foreground pixel.

Built up over the course of an entire image, you end up with a classification of every pixel. This could lead to models that separate the foreground and background in live video feeds, autonomous navigation systems, defect detection in industrial processes, and more.

Some models, such as the Segment Anything Model (website) have even been trained to generically segment any input image, as in the above image where we have a snail sat on top of a frog sat on top of a turtle, which is swimming in the water.

As alluded to earlier, you can also feed in other forms of 2D data. For example, this paper) predicts rainfall radar data a given number of hours into the future from a sample at the present moment. Or, my own research approximates the function of a physics-based model!

That's it for this post. If you've got something you'd like defining, please do leave a comment below!

I'm not sure what it'll be next, but it might be either staying at a high-level and looking at different ways that we can frame tasks in AI models, or I could jump to a lower level to look at fundamentals like loss (error) functions, backpropagation, layers (AI models are made up of multiple smaller layers), etc. What would you like to see next?

Stardust Blog

Tag Cloud

Defining AI: Sequenced models || LSTMs and Transformers

Sources and further reading

Defining AI: Image segmentation

Stardust
Blog