Transcription is one of the most useful techniques to apply to audio. It’s also fiendishly difficult, although models have gotten much better in recent years.
Use cases
If speech can be accurately converted into text, everything we can do with text we can now do with speech.
Try it out
Here we will try using the large version of the Whisper model, which should give excellent results over many languages.
Models
Popular models
Whisper is without a doubt the most popular transcription model at the moment. And not only that: it’s really really good. It supports a plethora of languages, too.
The only thing to note is that the Whisper model comes in different sizes: the larger the model, the better it is. But larger models are also slower, and require more resources to run, which potentially means it costs more. You’ll need to figure out what’s best for you based on your use case and the quality of your source audio.
You might also look at noise removal before sending your audio to be transcribed.
State of the art
I should look this up once I’m not on a plane.