from PIL import Image
import requests
import numpy as np
from transformers import CLIPProcessor, CLIPModel
# Using https://huggingface.co/openai/clip-vit-large-patch14
= CLIPModel.from_pretrained("openai/clip-vit-large-patch14")
model = CLIPProcessor.from_pretrained("openai/clip-vit-large-patch14") processor
Image classification allows you to put images into category. Typically these are yes/no questions.
- Does this satellite image show illegal mining operations?
- Does this photograph have a car in it?
- Does this x-ray show a broken bone?
Custom image classifiers are simple to build yourself! You can visit the build your own image classifier page for more details.
Image classifiers are on the simpler end of the visual analysis spectrum. More advanced models can show exactly where things are in an image (e.g. vegetation, swimming pools), or segment out those individual things (e.g. cars, people). Keep reading to learn more about semantic segmentation and object detection.
You might also be interested in models that can describe images, which can often also do classification!
Try it out
In the intro to fine-tuning we build a detector for illegal amber mines. Click the examples below to see how it performs (although let’s be honest: right now I’m cheating by giving it images it’s already seen).
Python code
In this example we’re going to use zero-shot image classification to see whether a painting is of a woman or a man. It’s called “zero-shot” because we aren’t customizing a model specifically for the task, but relying on a pre-trained vision model to know the difference between the two.
We’ll start with our imports and pulling in the model.
Then we read in an image from the internet and define the two choices that we’re going to be allowing the model to select between.
= "https://i.etsystatic.com/5554021/r/il/31ad32/644701138/il_1588xN.644701138_6gh4.jpg"
url = Image.open(requests.get(url, stream=True).raw)
image
= ["a painting of a woman", "a painting of a man"] choices
Finally, we (roughly) just use the rest of the example from the documentation. Those model pages are a delight when their code works right out of the box.
= processor(text=choices, images=image, return_tensors="pt", padding=True)
inputs = model(**inputs)
outputs
= outputs.logits_per_image # this is the image-text similarity score
logits_per_image = logits_per_image.softmax(dim=1) # we can take the softmax to get the label probabilities
probs
# Probabilities for each potential class
# We use [0] because we only passed one image
= probs.tolist()[0]
prediction_probs prediction_probs
[0.9841192364692688, 0.015880735591053963]
We’re given two probabilities, in order matching the classes we offered.
Below we use np.argmax
to figure out what the higher index is (0
or 1
) and then display that choice.
# Which class?
= np.argmax(prediction_probs)
prediction choices[prediction]
'a painting of a woman'
This is not a fantastic prompt: what about groups? What about paintings of vegetables? But it’s a start!
You aren’t restricted to two choices, you’re welcome to add as many as you’d like.