Defining AI: Image segmentation

Banner showing the text 'Defining AI' on top on translucent white vertical stripes against a voronoi diagram in white against a pink/purple background. 3 progressively larger circles are present on the right-hand side.

Welcome back to Defining AI! This series is all about defining various AI-related terms in short-and-sweet blog posts.

In the last post, we took a quick look at word embeddings, which is the key behind how AI models understand text. In this one, we're going to investigate image segmentation.

Image segmentation is a particular way of framing a learning task for an AI model that takes an image as an input, and then instead of classifying the image into a given set of categories, it classifies every pixel by the category each one belongs to.

The simplest form of this is what's called semantic segmentation, where we classify each pixel via a given set of categories - e.g. building, car, sky, road, etc if we were implementing a segmentation model for some automated vehicle.

If you've been following my PhD journey for a while, you'll know that it's not just images that can be framed as an 'image' segmentation task: any data that is 2D (or can be convinced to pretend to be 2D) can be framed as an image segmentation task.

The output of an image segmentation model is basically a 3D map. This map will obviously have the width and height, but then also have an extra dimension for the channel. This is best explained with an image:

(Above: a diagram explaining how the output of an image segmentation model is formatted. See below explanation. Extracted from my work-in-progress thesis!)

Essentially, each value in the 'channel' dimension will be the probability that pixel is that class. So, for example, a single pixel in a model for predicting the background and foreground of an image might look like this:

[ 0.3, 0.7 ]

....if we consider these classes:

[ background, foreground ]

....then this pixel has a 30% change of being a background pixel, and a 70% change of being a foreground pixel - so we'd likely assume it's a foreground pixel.

Built up over the course of an entire image, you end up with a classification of every pixel. This could lead to models that separate the foreground and background in live video feeds, autonomous navigation systems, defect detection in industrial processes, and more.

Some models, such as the Segment Anything Model (website) have even been trained to generically segment any input image, as in the above image where we have a snail sat on top of a frog sat on top of a turtle, which is swimming in the water.

As alluded to earlier, you can also feed in other forms of 2D data. For example, this paper) predicts rainfall radar data a given number of hours into the future from a sample at the present moment. Or, my own research approximates the function of a physics-based model!

That's it for this post. If you've got something you'd like defining, please do leave a comment below!

I'm not sure what it'll be next, but it might be either staying at a high-level and looking at different ways that we can frame tasks in AI models, or I could jump to a lower level to look at fundamentals like loss (error) functions, backpropagation, layers (AI models are made up of multiple smaller layers), etc. What would you like to see next?

Comments

(no comments yet)

Type this	To get this	Notes
`bold text`	bold text	-
`_italics text_`	italics text	-
`~~deleted text~~`	~~deleted~~	-
`code text`	`code text`	Inserts some monospaced code. It is preferred that large blocks of code are linked to using a service such as Pastebin, Github Gists or Ideone.
`> Quote`	Quote	-
`[display text](//google.com)`	display text	Inserts a hyperlink. Please use responsibly. `[rel=nofollow]` is in use and spam will be deleted.
`---`		Inserts a horizontal line. The previous line must be blank.

Stardust
Blog