Image classification

Putting images into buckets

Image classification allows you to put images into category. Typically these are yes/no questions.

Custom image classifiers are simple to build yourself! You can visit the build your own image classifier page for more details.

Image classifiers are on the simpler end of the visual analysis spectrum. More advanced models can show exactly where things are in an image (e.g. vegetation, swimming pools), or segment out those individual things (e.g. cars, people). Keep reading to learn more about semantic segmentation and object detection.

You might also be interested in models that can describe images, which can often also do classification!

Try it out

In the intro to fine-tuning we build a detector for illegal amber mines. Click the examples below to see how it performs (although let’s be honest: right now I’m cheating by giving it images it’s already seen).

Python code

In this example we’re going to use zero-shot image classification to see whether a painting is of a woman or a man. It’s called “zero-shot” because we aren’t customizing a model specifically for the task, but relying on a pre-trained vision model to know the difference between the two.

We’ll start with our imports and pulling in the model.

from PIL import Image
import requests
import numpy as np

from transformers import CLIPProcessor, CLIPModel

# Using
model = CLIPModel.from_pretrained("openai/clip-vit-large-patch14")
processor = CLIPProcessor.from_pretrained("openai/clip-vit-large-patch14")

Then we read in an image from the internet and define the two choices that we’re going to be allowing the model to select between.

url = ""
image =, stream=True).raw)

choices = ["a painting of a woman", "a painting of a man"]

Finally, we (roughly) just use the rest of the example from the documentation. Those model pages are a delight when their code works right out of the box.

inputs = processor(text=choices, images=image, return_tensors="pt", padding=True)
outputs = model(**inputs)

logits_per_image = outputs.logits_per_image # this is the image-text similarity score
probs = logits_per_image.softmax(dim=1) # we can take the softmax to get the label probabilities

# Probabilities for each potential class
# We use [0] because we only passed one image
prediction_probs = probs.tolist()[0]
[0.9841192364692688, 0.015880735591053963]

We’re given two probabilities, in order matching the classes we offered.

Below we use np.argmax to figure out what the higher index is (0 or 1) and then display that choice.

# Which class?
prediction = np.argmax(prediction_probs)
'a painting of a woman'

This is not a fantastic prompt: what about groups? What about paintings of vegetables? But it’s a start!

You aren’t restricted to two choices, you’re welcome to add as many as you’d like.