Text classification

Putting documents into buckets

Text classification is one of the most basic yet most useful machine learning methods: it’s the process of putting pieces of text into categories.

A “category” can be practically anything:

Use cases

At least in journalism, the most common use case for text classification is when we don’t want to read ten billion documents, we only want to see the ones we’re likely interested in. Maybe there was a big leak of emails, maybe we need to dig through a hefty amount of legislation, maybe we have a bunch of public comments on a government policy. It works like this:

  1. We read a portion of them, sorting them into two categories: YES we’re interested in this one, or NO we’re not.
  2. We feed it into the machine, the machine learns the difference between the two.
  3. We show it the rest of the documents, it sorts them into YES and NO piles for us.

Try it out

Below is a toxic comment detector, which classifies comments as various forms of toxic. Be kind, be mean, see it reacts! You can also see examples to click below.

Models

State of the art

State of the art isn’t really a big deal here: the majority of the time our text classifiers are going to be fine-tuned versions of common large language models. The capability of our classifier isn’t going to be restrained by the model we choose – they’re all pretty good! – but rather by how much of our dataset we manually categorize. The more the better!

To learn more, check out building your own text classifier or the introduction to using AutoTrain on Hugging Face (it’s about images but you’ll get the idea).