First things first: we need to learn about models. Models are the foundation of every single artificial intelligence or machine learning system out there!
This is by far the longest, most detailed page on the site! I’m biased, but I do think it’s absolutely worth reading through.
The sections below provide the foundation to understand the rest of the site. Buzzwords like models, training, and fine-tuning, along with how you might use leverage (or even customize!) AI/ML tools in your own work.
Generally speaking, a model simplifies the wild complicated system of the world into something simple. Machine learning and AI models learn to recognize patterns in the data given to them, which they then use to make predictions.
It’s tough to say what we mean by “data” and “predictions,” because it could be pretty much anything:
The data could be almost anything that exists on a computer: a document, a photograph, a web page, an audio file, a video…
The prediction could be as simple as a yes/no answer or a category, or as complicated as coordinates where it found a bunny rabbit or an image of a shoe!
A model might learn that the word “love” is positive, what a coffee cup looks like, or where to find the total on a receipt. It’s practically limitless!
In this walkthrough we’re going to use a simple, relateable model to walk through what AI or machine learning models can do, how they are made, and how they might disagree with one another (even if they’re doing the same thing!). Along the way you’ll learn all about sentiment analysis, the ability of a computer to detect human emotion in pieces of text.
Introduction to sentiment analysis
Let’s say we’re Coca-Cola, and we release a new product called New Coke. How can we tell whether people like it or not?
Well, we could go interview a lot of people about their thoughts on New Coke, but that doesn’t seem very modern! Instead, we’ll just download 1,000,000 tweets that mention New Coke and see if most of the tweets are positive or negative.
But how are we going to tell whether those tweets are positive or negative?
Well, we could pay an army of interns to review all of those tweets, we’d rather spend that money on donuts. Instead, we’ll just have a computer tell us whether each tweet is positive or negative.
That’s sentiment analysis. A computer looking at some text and and telling us the emotion in it!
In this case, we’re going to be using a sentiment analysis model. While we could pay a person with a full-functioning understanding of language, irony, criticism, culture, etc, to rate the tweets, that’s just too much work. Instead we’ll use a simplified model that a computer can operate.
Performing sentiment analysis
We’re going to be using a tool called Hugging Face to do sentiment analysis for us. If you can’t code, no worries! It’s super simple and we only care about the output.
We’ll start by performing sentiment analysis on a very simple, very easy statement: I love you (we borrowed the code from a blog post about how to do sentiment analysis in Python).
from transformers import pipelinesentiment_pipeline = pipeline("sentiment-analysis")data = ["I love you"]sentiment_pipeline(data)
No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
According to this tool, “I love you” is POSITIVE with a score of 0.9999! That seems pretty accurate and pretty fantastic.
One thing that doesn’t seem fantastic, though, is the warning we got:
No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english). Using a pipeline without specifying a model name and revision in production is not recommended.
We don’t understand the details, but it seems like we should probably specify a model name.
Out of that list we go ahead and pick twitter-roberta-base-sentiment to test out. The website says it’s been downloaded over a million times in the past month, and based on the name it seems to know something about Twitter (remember, we’re supposedly analyzing tweets!). It’s a perfect match!
Let’s give replace the distilbert-base-uncased-finetuned-sst-2-english model with this new model and see what happens:
sentiment_pipeline = pipeline("sentiment-analysis", model='cardiffnlp/twitter-roberta-base-sentiment')data = ["I love you"]sentiment_pipeline(data)
That part sounds good, but there’s a catch: our last model said it was positive with a 0.9999 probability, while this one scores at 0.9557 instead.
Is one of those responses more correct than the other?
One person might say “I love you” is just about as positive as you can possibly get. It’s the ultimate positive phrase, and deserves a nice high 0.9999 score.
Another person might say that even though “I love you” is supposed to be the ultimate positive phrase, what about “I really love you,” or “I really really really love you?” They’re even more positive than a simple “I love you” and deserve a higher score! Therefore 0.9557 is perfectly fine for a basic boring love proclamation.
A third person says well, I’m not here to rank anyone’s love, let’s just all agree that they’re positive and ignore the score completely!
An important lesson
Just like people can have different opinions, different models can disagree with each other.
While some things are reasonably objective - love vs hate being positive vs negative, for example – other situations like edge cases, nuance or degree isn’t always so easy.
Why did they disagree? We’ll get to that soon!
Limits of knowledge
While models will often disagree about details, there’s another situation that might come up: sometimes models are just plain ignorant!
Let’s bring twitter-roberta some international French romance with Je t’adore.
A proclamation of love in most romantic language of them all, and twitter-roberta gives it a LABEL_1: neutral! We might not know anything about French, but we know that’s just plain wrong.
Finding more models
Luckily there are many many models, so chances are we can find one that speakes French!
The positive score isn’t as high as we might have hoped. But at least the model knows “I love you” in English is positive! Before we tracked down this multilingual model, our old model thought “Je t’adore” was neutral.
An important lesson
Models only know what they’ve been taught: just like my cat can’t load the dishwasher, some models only know English and can’t judge French sentiment. When you start using a model, you’ll want to double-check that it understands what you’re asking it to work on.
How models learn
Why does one model know French and another doesn’t? Or how do two models end up disagreeing about the same sentence? Just like people, models have different backgrounds.
There are two major ways that models learn: training and fine tuning. We’ll talk about them more in the next chapter, but for now we’ll provide a short overview.
Training
Training is the process of teaching a model from the very beginning. Think of the model like a little baby who needs to learn to walk and crawl and do all those things from absolute zero experience!
In the case of sentiment analysis, we might train the model by showing it examples of positive tweets and saying “this is a positive tweet!,” then showing it examples of negative tweets and saying “this is a negative tweet!”. Over time it learns the word “love” shows up more often in positive tweets, while “hate” shows up more often in negative tweets.
In our case: the model that understands French was also trained on French data instead of just English. And the two models that disagree about exactly how positive a sentence is might have just seen different examples of positive and negative tweets over their “lifetime,” which makes them come to different score conclusions.
A problem with training is it can require a lot of data to teach a model effectively. Another option is fine-tuning.
Fine-tuning
Fine-tuning is taking an existing model and teaching it something new. It’s like finding someone who knows French and teaching them Spanish or German: they might not be perfect immediately, but some of the skills they know already probably transfer over (and they’re definitely better than a baby!).
For the trained-from-scratch model, once it learned that “love” was positive, that was the end of things. But not for the fine-tuned model! Since the model understands relationships between words, the model might also understand that “loving,” “like” and other similar words should increase the positive score.
If you notice the model names so far, they’re twitter-xlm-roberta-base-sentiment and distilbert-base-uncased-finetuned-sst-2-english. The first one is actually a fine-tuned version of a model called RoBERTa and the second is a fine-tuned version of a model called DistilBERT!
These original models are general-purpose language models that understand how language and words work, and the ones we’re using have been tweaked specifically to learn sentiment analysis.
How do you find positive or negative tweets to train your dataset on? Finding data is one of the big challenges of training (or fine-tuning) machine learning or AI models.
One dataset I find amazing is Sentiment140, a dataset of 1.6 million tweets tagged as positive or negative. Did they pay hundreds of people to read through each tweet, marking it as positive or negative? No, they just looked for tweets with :) and :( and marked them as positive or negative!
You’ll often find sentiment models trained on movie reviews from IMDB or product reviews from Amazon. That’s because just like the last example, you don’t need to ask anyone whether it’s positive or negative before you feed it to the model: every review automatically comes with a score! You can be confident that one star is negative and 5 stars is positive (although expanding outside of that becomes more of an editorial decision).
We’ll discuss this much more in upcoming chapters.
How different are each of these models? Let’s take three of them and compare their results across a handful of sentences.
import pandas as pdpd.options.display.max_colwidth =Nonepd.options.display.float_format ='{:.3f}'.formatdf = pd.DataFrame({'content': ["I love love love love this kitten","I hate hate hate hate this keyboard","I'm not sure how I feel about toast","Sales of bad candy are up 200% in the third quarter","Did you see the baseball game yesterday?","The package was delivered late and the contents were broken","Trashy television shows are some of my favorites","I'm seeing a Kubrick film tomorrow, I hear not so great things about it.","I find chirping birds irritating, but I know I'm not the only one",]})# Twitter sentimentsentiment_pipeline = pipeline(model="cardiffnlp/twitter-xlm-roberta-base-sentiment")results = sentiment_pipeline(df.content.tolist())results = pd.DataFrame(results).add_prefix('twitter-')df = df.join(results)# Amazon review sentimentsentiment_pipeline = pipeline(model="LiYuan/amazon-review-sentiment-analysis")results = sentiment_pipeline(df.content.tolist())results = pd.DataFrame(results).add_prefix('amazon-')df = df.join(results)# Financial news sentimentsentiment_pipeline = pipeline(model="mrm8488/distilroberta-finetuned-financial-news-sentiment-analysis")results = sentiment_pipeline(df.content.tolist())results = pd.DataFrame(results).add_prefix('finance-')df = df.join(results)df
content
twitter-label
twitter-score
amazon-label
amazon-score
finance-label
finance-score
0
I love love love love this kitten
positive
0.889
5 stars
0.931
neutral
1.000
1
I hate hate hate hate this keyboard
negative
0.937
1 star
0.619
neutral
1.000
2
I'm not sure how I feel about toast
negative
0.605
3 stars
0.378
neutral
1.000
3
Sales of bad candy are up 200% in the third quarter
neutral
0.524
1 star
0.454
positive
1.000
4
Did you see the baseball game yesterday?
neutral
0.849
5 stars
0.504
neutral
1.000
5
The package was delivered late and the contents were broken
negative
0.816
1 star
0.609
neutral
0.996
6
Trashy television shows are some of my favorites
positive
0.770
5 stars
0.815
neutral
1.000
7
I'm seeing a Kubrick film tomorrow, I hear not so great things about it.
negative
0.418
3 stars
0.439
neutral
0.995
8
I find chirping birds irritating, but I know I'm not the only one
negative
0.897
3 stars
0.462
neutral
0.992
The first thing we’ll notice is that it isn’t always just positive and negative options for the three models:
We might make an assumption that 1-2 stars is negative, 3 is neutral, and 4-5 is positive. Even then, the sentiment scores between the three models are rarely the same!
The most interesting is “Sales of bad candy are up 200% in the third quarter.” This sentence is neutral to the Twitter sentiment model, but is the only positive sentence for the financial model! Thanks to the power of fine-tuning, everything the finance model might have once known about the beauty of the world has been discarded in favor of sales sales sales!
An important lesson
Whoever built the model is in control of what the model thinks, and the model only knows what it’s been shown. As much as we as humans might think loving a kitten is a positive thing, it isn’t the case for a model that only knows about finance!
Labels and text classification
The Amazon review model showed us something interesting: sentiment analysis doesn’t need to explicitly be about sentiment. In the same way that a model can associate words with being a positive or negative tweet, it could also rate it as one, two, three, four or five stars.
And to be honest with you: it goes far beyond that. You can train a model to put a piece of text into any sort of categories, from scanning for toxic comments online to determining whether a legislative bill is about gun control.
Sentiment analysis is just one tiny subset of text classification models, which are (predictably) about putting classifying pieces of text. Some examples of text classification models from Hugging Face:
While sentiment analysis has pretty limited use cases for us (sorry Coca-Cola!), text classification as a whole opens up a wide, wide world of opportunities. In the next section we’ll look at how to fine-tune our own text classifier and how the modern methods contrast with the hand-crafted machine learning techniques of the past.