Visual Question Answering (VQA) is a way to ask questions about pictures! If we want to get technical, it isn’t actually an image model: it’s a multimodal model that understands both images and text! The text part comes in when understanding what your question means.
Use cases
News outlets had a nasty habit of using pictures of Asian people in stories about COVID, even if the stories were 100% about the United States. During a talk about AI editors I showed how you could run these images through a VQA and ask things like “is there an Asian person in this image?” or “is this a photograph of Chinatown?”
Try it out
This example allows you to compare several different question-answering models.
VQA Models
Popular models
If you look at the Hugging Face models page VQA isn’t nearly as popular as text-based models. The BLIP model from Salesforce is a popular one, though, as is ViLT.
State of the art
If you poke around paperswithcode, the most popular evaluation is VQA v2 test-dev. Most models score a nudge above 80%.
You can see the kinds of questions the test asks on the VQA v2 teaser page.