Transcription APIs - Michele Ong

The disadvantage of these services is that the data you get out is raw. If you need to do anything with them, you need to transform them into another format. If you need to make corrections, you can either use the full transcription text which is usually included but not diarised, transform the data into un-timed diarised text, or you need to use an editor that can manage the timing aspect as well as timecodes will be attached per word for synchronisation.

Google Cloud Speech to Text

Accuracy isn’t bad for short lengths of audio. Requires only minor corrections, and stumbles over more technical terms.

The API offers 60 minutes of audio processing per month for free before billing kicks in, which makes it ideal for generating SRTs for short clips. The billable rate looks to be the highest of the other services I’ve looked at.

DeepGram

Accuracy isn’t bad for long audio. I have used it for full episodes. These still need correction, but it performs well on some sections of more technical terms which I thought it would struggle with. Speaker diarisation is mostly okay, does detect the occasional extra speaker or misattributes, but not terrible to work with given I need to make corrections anyway.

The free trial includes ‘12,000 free minutes’ which is around USD$150 in credit, I think. Once that runs out, the billable rate including speaker diarisation is around USD$1.50/hour.

AssemblyAI

I haven’t done a side-by-side comparison with DeepGram but accuracy is about the same for content and speaker diarisation as with DeepGram. Pretty usable.

At the time I signed up they had a free trial plan with 3/hrs/month. The copy for this isn’t on the site anymore, so I’m not sure what the current entitlements of the free trial are, but given my release schedule and average episode lengths, this currently works for me. The billable rate with speaker diarisation is around USD$3.00/hour.

Mozilla DeepSpeech

With the pre-trained models, really awful. Not worth the time needed to make corrections. Could be better with some finessing and training, but I don’t have the time to look into it.

The only benefit to this one is that it’s offline only which means you’re not sending your data anywhere.