While I would love to offer captions and transcripts on full episodes in a variety of languages for accessibility, this isn't presently possible with my availability[1] or budget. I would, however, be incredibly appreciative if a kind patron or sponsor would like to donate towards a subscription to Descript, AssemblyAI, or DeepGram so that this podcast and YouTube channel can reach more prospective and current #WomenInSTEM and other STEAM enthusiasts. 🙏

Find me on Patreon.

I've trialled a few different services for STT and transcription of the episodes. None of these experiments are at a stage where I can fully incorporate it into my workflow.

Update: I now have Descript which handles the transcripts for me. Accuracy isn't bad, but it's good enough and the editing interface makes this quite straight forward. Crosstalk and laughter can cause problems. You can resolve this by replacing the offending part with noise floor, but the interface prevents you from trying to forcibly change the transcript and can introduce additional errors.

Transcription Only

The disadvantage of these services is that the data you get out is raw. If you need to do anything with them, you need to transform them into another format. If you need to make corrections, you can either use the full transcription text which is usually included but not diarised, transform the data into un-timed diarised text, or you need to use an editor that can manage the timing aspect as well as timecodes will be attached per word for synchronisation.

Google Cloud Speech to Text

Accuracy isn't bad for short lengths of audio. Requires only minor corrections, and stumbles over more technical terms.

The API offers 60 minutes of audio processing per month for free before billing kicks in, which makes it ideal for generating SRTs for short clips. The billable rate looks to be the highest of the other services I've looked at.

Mozilla DeepSpeech

With the pre-trained models, really awful. Not worth the time needed to make corrections. Could be better with some finessing and training, but I don't have the time to look into it.


Accuracy isn't bad for long audio. I have used it for full episodes. These still need correction, but it performs well on some sections of more technical terms which I thought it would struggle with. Speaker diarisation is mostly okay, does detect the occasional extra speaker or misattributes, but not terrible to work with given I need to make corrections anyway.

The free trial includes '12,000 free minutes' which is around USD$150 in credit, I think. Once that runs out, the billable rate including speaker diarisation is around USD$1.50/hour.


I haven't done a side-by-side comparison with DeepGram but accuracy is about the same for content and speaker diarisation as with DeepGram. Pretty usable.

At the time I signed up they had a free trial plan with 3/hrs/month. The copy for this isn't on the site anymore, so I'm not sure what the current entitlements of the free trial are, but given my release schedule and average episode lengths, this currently works for me. The billable rate with speaker diarisation is around USD$3.00/hour.



This service is richly featured. It offers transcription and video editing functionality. As with DeepGram and AssemblyAI, accuracy is about the same, but Descript also has an excellent editor that allows you to correct the transcripts and maintain time synchronisation.

The free plan gives you 3/hrs, and the priced tiers have fixed hours per tier but the service offers other video/audio editing features as well. Even if you just considered cost per hour, it's cheaper than the straight transcription services.

And just for reference, you can BYO transcription and that won't count towards your transcription hours but you'll still be able to use the other plan entitlements.


I plan to write a longer post about this later. Even though I can't use it for my purposes, I think it's wonderful, and I have no doubt that this work is what a lot of the popular video transcript editing services are based on. In fact, I've seen an article by one service that says they dissected it as part of their development process.

autoEdit is an open source application with built-in integration with DeepSpeech (sketchy) and AssemblyAI for transcription but also provides a paper-editing and qualitative analysis workflow for crafting interviews. The paper-edits can also be exported as an EDL (edit decision list) and used in other video editing software.

Conceptually, it's very good. I very much like the paper-editing and qualitative analysis integration side of it. Unfortunately, the transcript editor isn't very good at managing corrections with time synchronisation, and fixing diarisation attributions also causes it to fall out of time.

The paper-editing side is great and allows you to tag sections of the transcript which you can use to identify themes or key quotes which are linkable to other recordings in a project. However, due to the way projects and transcripts are stored in the backend, the app runs very slow once you've loaded a bunch of transcripts into a single project, and the paper-edit feature fails to load at all.

If you have small projects with only a few interviews in each, this is great. If you have corrected transcripts, with a little scripting, you can also manually import them into the app so that you can use the paper-editing functionality.

To be Investigated

Some cool things I'll look into later when time and resources allow.

  • BBC Github

    BBC has a great open source initiative, and they have a bunch of utilities related to video and audio editing.

  • BBC Kaldi

    BBC's Kaldi STT engine that I think they run BBC Archives through. It's free to access if you are using it for not for profit community projects. ❤️

  1. I record in my dining room, and my guests are record via video call wherever they are and use headsets and mics (some onboard) which can make for unpredictable audio quality. Add to that the use of technical and scientific terminology and we have a melting pot of factors that can reduce transcription accuracy. Which means I have to manually correct the transcripts. This is time-consuming.

More in Transcripts

Related Posts

Published November 09, 2022