Normal AI - VQA: Asking questions about pictures

Visual Question Answering (VQA) is a way to ask questions about pictures! If we want to get technical, it isn’t actually an image model: it’s a multimodal model that understands both images and text! The text part comes in when understanding what your question means.

Use cases

News outlets had a nasty habit of using pictures of Asian people in stories about COVID, even if the stories were 100% about the United States. During a talk about AI editors I showed how you could run these images through a VQA and ask things like “is there an Asian person in this image?” or “is this a photograph of Chinatown?”

Try it out

This example allows you to compare several different question-answering models.

VQA Models

Popular models

If you look at the Hugging Face models page VQA isn’t nearly as popular as text-based models. The BLIP model from Salesforce is a popular one, though, as is ViLT.

State of the art

If you poke around paperswithcode, the most popular evaluation is VQA v2 test-dev. Most models score a nudge above 80%.

You can see the kinds of questions the test asks on the VQA v2 teaser page.