Defining AI: Sequenced models || LSTMs and Transformers
Hi again! I'm back for more, and I hope you are too! First we looked at word embeddings, and then image embeddings. Another common way of framing tasks for AI models is handling sequenced data, which I'll talk about briefly today.
Sequenced data can mean a lot of things. Most commonly that's natural language, which we can split up into words (tokenisation) and then bung through some word embeddings before dropping it through some model.
That model is usually specially designed for processing sequenced data, such as an LSTM (or it's little cousin the GRU, see that same link). An LSTM is part of a class of models called Recurrent Neural Networks, or RNNs. These models work by feeding their output back in to themselves again along with the next element of input, thereby processing a sequence of items. The output bit that feeds back in is basically the memory of the thing that remembers stuff across multiple items in the sequence. The model can decide to add or remove things from this memory at each iteration, which is how it learns.
The problem with this class of model is twofold: firstly, it processes everything serially which means we can't compute it in parallel, and secondly if the sequence of data it processes is long enough it will forget things from earlier in the sequence.
In 2017 a model was invented that solves at least 1 of these issues: the Transformer. Instead of working like a recurrent network, a Transformer processes an entire sequence at once, and encodes a time signal (the 'positional embedding') into the input so it can keep track of what was where in the sequence - a downside to processing in parallel means you don't know which element of the sequence was where.
The transformer model also brought with it the concept of 'attention', which is where the model can decide what parts of the input data are important at each step. This helps the transformer to focus on the bits of the data that are relevant to the task at hand.
Since 2017, the number of variants of the original transformer has exploded, which stands testament to the popularity of this model architecture.
Regarding inputs and outputs, most transformers will take an input in the same shape as the word embeddings in the first post in this series, and will spit it out in the same shape - just potentially with a different shape to the embedding dimension:
In this fashion, transformers are only limited by memory requirements and computational expense with respect to sequence lengths, which has been exploited in some model designs to convince a transformer-style model to learn to handle some significantly long sequences.
This is just a little look at sequenced data modelling with LSTMs and Transformers. I seem to have my ordering a bit backwards here, so in the next few posts we'll get down to basics and explain some concepts like Tensors and shapes, loss functions and backpropagation, attention, and how AI models are put together.
Are there any AI-related concepts or questions you would like answering? Leave a comment below and I'll write another post in this series to answer your question.
Sources and further reading
- The Illustrated Transformer
- Illustrated Guide to LSTMs and GRUs: A step by step explanation
- Attention is All You Need - the original paper about the transformer model architecture
- Long Short-Term Memory - the original paper about the LSTM
- Transformer models: an introduction and catalog
- The Unreasonable Effectiveness of Recurrent Neural Networks - older post from 2015 but a very cool earlier character-based text generation model. The hallucination by that model is a great example of why modern Transformer-based text generation models (a topic for another time) like ChatGPT have a similar issue.