Transcribing Podcasts with Microsoft Speech API
Introduction
I enjoy listening to podcasts in a wide range of topics that include but are not limited to politics, software development, history, comedy and true crime. While some of the more entertainment related podcasts are best enjoyed through audio, those that are related to software development or other type of hands-on topics would benefit greatly from having a transcript. Having a transcript allows me to go back after having listened to an interesting discussion and look directly at the content I am interested in without having to listen to the podcast again. This however is not always feasible given that it costs both time and money to produce a transcript. Fortunately, there are tools out there such as Google Cloud Speech and Microsoft Speech API which allow users to convert speech to text. For this writeup, I will be focusing on the Microsoft Speech API. Because podcasts tend to be long-form, I will be using the C# client library because it allows for long audio (greater than 15 seconds) to be transcribed. The purpose of this exercise is to create a console application that takes audio segments, converts them to text and stores the results in a text file with the goal of evaluating how well the Microsoft Speech API works. The source code for the console application can be found on GitHub
Prerequisites
Install/Enable Windows Subsystem for Linux
- Open Powershell as Administrator and input
Enable-WindowsOptionalFeature -Online -FeatureName Microsoft-Windows-Subsystem-Linux
- Restart your computer
- Open the Microsoft Store and install Ubuntu Linux distribution
- Once installed, click
Launch
- Create your LINUX user account (Keep in mind that this is not the same as your Windows account therefore it can be different).
Install ffmpeg
Once Ubuntu is installed on your computer and you have created your LINUX user account, it's time to install ffmpeg
. This will allow us to convert our file from mp3
which is usually the format podcasts are in to wav
which is the format accepted by the Microsoft Speech API
sudo apt install -y ffmpeg
Get Bing Speech API Key
In order to use the Microsoft Speech API, an API key is required. This can be obtained using the following steps.
- Navigate to the Azure Cognitive Services page.
- Select the
Speech
tab - Click
Get API Key
and follow the instructions - Once you have an API key, make sure to store it somewhere like a text file in your computer for future use.
File Prep
Before transcribing the files, they need to be converted to wav
format if they are not already in it. In order to ease processing, they should be split into segments as opposed to having the API process an entire multi megabyte file. To do all this, inside the Ubuntu shell, first navigate to your Documents
folder and create a new directory for the project.
cd /mnt/c/Users/<USERNAME>/Documents
mkdir testspeechapi
USERNAME
is your Windows user name. This can be found by typing the following command into the Windows CMD
prompt
echo %USERNAME%
Download Audio Files
One of the podcasts I listen to is Talk Python To Me by Michael Kennedy. For the purpose of this exercise, aside from having great content, all of the episodes are transcribed and a link is provided to the mp3
file.
The file I will be using is from episode #149, but it can easily be any of the episodes.
In the Ubuntu shell, download the file into the testspeechapi
directory that was recently created using wget
.
wget https://talkpython.fm/episodes/download/149/4-python-web-frameworks-compared.mp3 -O originalinput.mp3
Download Transcripts
The transcripts can be found in GitHub at this link. The original transcript will allow me to compare it to the output of the Microsoft Speech API and evaluate the accuracy.
We can download this file into the testspeechapi
directory just like the audio file
wget https://raw.githubusercontent.com/mikeckennedy/talk-python-transcripts/master/transcripts/149.txt -O originaltranscript.txt
Convert MP3 to WAV
Now that we have both the original audio file and transcript, it's time to convert the format from mp3
to wav
. To do this, we can use ffmpeg
. In the Ubuntu shell, enter the following command.
ffmpeg -i originalinput.mp3 -f wav originalinput.wav
Create Audio Segments
Once we have the proper format, it's time to make the files more manageable for processing. This can be done by splitting them up into equal segments using ffmpeg
. In this case I'll be splitting it up into sixty second segments and storing them into a directory called input
with the name filexxxx.wav
. The %04d
indicates that there will be 4 digits in the file name. Inside the Ubuntu shell and testspeechapi
directory enter the following commands.
mkdir input
ffmpeg -i originalinput.wav -f segment -segment_time 60 -c copy input/file%04d.wav
Building The Console Application
To get started building the console application, we can leverage the Cognitive-Speech-STT-ServiceLibrary sample program.
Download The Sample Program
The first step will be to clone the sample program from GitHub onto your computer. In the Windows CMD
prompt navigate to the testspeechapi
folder and enter the following command.
git clone https://github.com/Azure-Samples/Cognitive-Speech-STT-ServiceLibrary
Once the project is on your computer, open the directory and launch the SpeechClient.sln
solution in the sample
directory
When the solution launches, open the Program.cs
file and begin making modifications.
Reading Files
To read files, we can create a function that will return the list of files in the specified directory.
private static string[] GetFiles(string directory)
{
string[] files = Directory.GetFiles(directory);
return files;
}
Process Files
Once we have the list of files, we can then process each file individually using the Run
method. To do so, we need to make a slight modification to our Main
method so that it iterates over each file and calls the Run
method on it. To store the responses from the API, we'll also need a StringBuilder
object which is declared at the top of our Program.cs
file.
finalResponse = new StringBuilder();
string[] files = GetFiles(args[0]);
foreach(var file in files)
{
p.Run(file,"en-us",LongDictationUrl,args[1]).Wait();
Console.WriteLine("File {0} processed",file);
}
Transcribe Audio
The Run
method can be left intact. However, the Run
method uses the OnRecognitionResult
method to handle the result of API responses. In the OnRecognitionResult
method, we can remove almost everything that is originally there and replace it. The response from the API returns various results of potential phrases as well as a confidence value. Generally, most of the phrases are alike and the first value is good enough for our purposes. The code for this part will take the response from the API, append it to a StringBuilder
object and return when completed.
public Task OnRecognitionResult(RecognitionResult args)
{
var response = args;
if(response.Phrases != null)
{
finalResponse.Append(response.Phrases[0].DisplayText);
finalResponse.Append("\n");
}
return CompletedTask;
}
Output Transcribed Speech
When all the audio files have been processed, we can save the final output to a text file in the testspeechapi
directory. This can be done with the SaveOutput
function to which we pass in a file name and the StringBuilder
object that captured responses from the API.
private static void SaveOutput(string filename,StringBuilder content)
{
StreamWriter writer = new StreamWriter(filename);
writer.Write(content.ToString());
writer.Close();
}
SaveOutput
can then be called from our Main
method like so.
string username = Environment.GetEnvironmentVariable("USERNAME", EnvironmentVariableTarget.Process);
SaveOutput(String.Format(@"C:\\Users\\{0}\\Documents\\testspeechapi\\apitranscript.txt",username), finalResponse);
The final Main
method should look similar to the code below
public static void Main(string[] args)
{
// Send a speech recognition request for the audio.
finalResponse = new StringBuilder();
string[] files = GetFiles(args[0]);
var p = new Program();
foreach (var file in files)
{
p.Run(file, "en-us", LongDictationUrl, args[1]).Wait();
Console.WriteLine("File {0} processed", file);
}
string username = Environment.GetEnvironmentVariable("USERNAME", EnvironmentVariableTarget.Process);
SaveOutput(String.Format(@"C:\\Users\\{0}\\Documents\\testspeechapi\\apitranscript.txt",username), finalResponse);
}
Output
The program will save the output of the StringBuilder
object finalResponse
to the file apitranscript.txt
.
Prior to running the program it needs to be built. In Visual Studio change the Solutions Configurations option from Debug
to Release
and build the application.
To run the program, navigate to the C:\Users\%USERNAME%\Documents\testspeechapi\Cognitive-Speech-STT-ServiceLibrary\sample\SpeechClientSample\bin\Release
directory in the Windows CMD
prompt and enter the following command and pass in the input
directory where all the audio segments are stored and your API Key.
SpeechClientSample.exe C:\\Users\\%USERNAME%\\Documents\\testspeechapi\\input <YOUR-API-KEY>
This process may take a while due to the number of files being processed, so be patient.
Sample Output
Original
Are you considering getting into web programming? Choosing a web framework like Pyramid, Flask, or Django can be daunting. It would be great to see them all build out the same application and compare the results side-by-side. That's why when I heard what Nicholas Hunt-Walker was up to, I had to have him on the podcast. He and I chat about four web frameworks compared. He built a data-driven web app with Flask, Tornado, Pyramid, and Django and then put it all together in a presentation. We're going to dive into that right now. This is Talk Python To Me, Episode 149, recorded January 30th, 2018. Welcome to Talk Python to Me, a weekly podcast on Python, the language, the libraries, the ecosystem, and the personalities. This is your host, Michael Kennedy, follow me on Twitter where I'm @mkennedy. Keep up with the show and listen to past episodes at talkpython.fm, and follow the show on Twitter via @talkpython. Nick, welcome to Talk Python.
API Response
Are you considering getting into web programming cheese in the web framework like plaster. Django can be daunting. It would be great to see them. Although that the same application and compare the results side by side. That's Why? When I heard. But Nicholas Hunt. Walker was up to, I had him on the podcast. He night chat about 4 web frameworks, compared he built a data driven web app with last tornado. Peermade Angangueo and then put it all together in a presentation. We have dive into that right now. This is talk by the to me, I was 149 recorded January 30th 2018. Welcome to talk python to me a weekly podcast on Python Language the library ecosystem in the personalities. Did you host Michael Kennedy? Follow me on Twitter where I'm at in Kennedy keep up with the show in the past episodes at talk python. Damn info, the show on Twitter via at Popeyes on. Nick Welcome to talk by phone.
Conclusion
In this writeup, I converted an mp3
podcast file to wav
format and split it into sixty second segments using ffmpeg
. The segments were then processed by a console application which uses the Microsoft Speech API to convert speech to text and the results were saved to a text file. When compared to the original transcript, the result produced by the API is inconsistent and at times incomprehensible. This does not mean that overall the API is unusable or inaccurate as many things could have contributed to the inaccuracy. However, the results were not as good as I would have hoped for.