Text classification is one of the most basic yet most useful machine learning methods: it’s the process of putting pieces of text into categories.
A “category” can be practically anything:
- A tweet can be positive or negative
- A book can be romance, sci-fi, or literary fiction
- A document can be Spanish, Portuguese or French
Use cases
At least in journalism, the most common use case for text classification is when we don’t want to read ten billion documents, we only want to see the ones we’re likely interested in. Maybe there was a big leak of emails, maybe we need to dig through a hefty amount of legislation, maybe we have a bunch of public comments on a government policy. It works like this:
- We read a portion of them, sorting them into two categories:
YES
we’re interested in this one, orNO
we’re not. - We feed it into the machine, the machine learns the difference between the two.
- We show it the rest of the documents, it sorts them into
YES
andNO
piles for us.
Try it out
Below is a toxic comment detector, which classifies comments as various forms of toxic. Be kind, be mean, see it reacts! You can also see examples to click below.
Models
Popular models
Any modern language model works fine for this. You can find plenty on Hugging Face, they’ll typically be something with BERT or LM in the name. You might even find a pre-trained one for the classification task you’re looking for!
State of the art
State of the art isn’t really a big deal here: the majority of the time our text classifiers are going to be fine-tuned versions of common large language models. The capability of our classifier isn’t going to be restrained by the model we choose – they’re all pretty good! – but rather by how much of our dataset we manually categorize. The more the better!
To learn more, check out building your own text classifier or the introduction to using AutoTrain on Hugging Face (it’s about images but you’ll get the idea).