There are a handful of different ways that computers can look at images, and one of the options is “stuff” versus “things.” Semantic segmentation can be used to find “stuff,” the number of pixels dedicated to people, cars, vegetation, etc in an image.
Use cases
Semantic segmentation is great at answering the question of “how much land is covered by XXX?” when using satellite imagery. In the example below, the model has been trained to search for vegetation.
Try it out
Here’s a great example that identifies vegetation. Scroll down to click the examples.
VQA Models
Popular models
On the Hugging Face models list, semantic segmentation is called “image segmentation”. The most popular models are CLIPSeg or Facebook’s MaskFormer.
State of the art
If you poke around paperswithcode, accuracy really depends on the test you’re working from. It’s everything from around 60% on ADE20K val to almost 90% on Cityscapes.
The state of the art models appear to be InternImage-H and BEiT-3.