There are a handful of different ways that computers can look at images, and one of the options is “stuff” versus “things.” Instance segmentation can be used to find “things:” each individual person, car, tree, etc in an image.
Use cases
Honestly, the most common use of image description models these days is probably creating prompt ideas to feed into image generation models! You can also use it for finding similar images.
Aside from just describing an image, these models can also perform zero-shot image classification, which means they can put images into categories without having explicitly seen the categories before. I can show it a picture and say, “are they playing sports or playing music?” and because it knows generally what sports look like and and music looks like, it can apply those skills to a classification job that I’d otherwise have to fine-tune a model for.
Try it out
Image description
This example is used to help think of prompts for image generation tools like Stable Diffusion.
Zero-shot image classification (Chinese!)
While CLIP is based on English, you can also find models that are not based on English, like this Chinese example. But note that if you change the labels - Candidate Labels 候选分类标签 - to be in English, it still gives you the right answers!
Models
Popular models
You’re you’re definitely using CLIP for this!