Document similarity or sentence similarity can be used to find pieces of text that are related to one another. Through a bit of magic, each document is transformed into 384 dimensions, which can be used to find ones that are about the same topics. The dimensions don’t mean anything, but you can read more about word embeddings here.
Use cases
When searching across email dumps or pieces of legislation, document similarity can make it easy to find related texts. In some ways it’s a simple alternative to training and using a classifier.
The idea of searching for similar concepts or documents as opposed to specific phrases is called “semantic search.”
Modern document similarity models can even match across languages! If you search for “I am embezzling funds please don’t tell anyone” in English, it can pull up documents in non-English languages.
Try it out
Couldn’t find a good example yet! For now you can use this colab notebook
Models
Popular models
You can use sentence-transformers/all-MiniLM-L6-v2 for sentence similarity, but it only works with English (similarly, shibing624/text2vec-base-chinese only works with Chinese). If you scroll around on the sentence similarity page you’ll eventually find a sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2, a multi-lingual model