Speech transcription is a problem that’s commonly solved with expensive human workers. With machine learning though, computers have caught up, and AWS’s AI-powered Speech Recognition Toolkit is now available as a service for your application to use.
AWS Transcribe Converts Audio Files in S3
Transcribe is simple—give it an audio file (stored in S3), and it can churn through it and give you an output. You are charged based on the length of audio, at a rate of $0.0004 per second. A two-hour boardroom meeting would cost $2.88 to transcribe, but a quick two-minute video only costs $0.06.
Transcribe is pretty fast, but it’s not latency optimized. It’s well suited for after-the-fact transcription, such as transcribing customer calls and subtitling uploaded video. If you need real-time speech-to-text transcription, you can use AWS Lex, a service for building interactive chat bots like Alexa.
To get started, head over to the AWS Transcribe Console. You can press “Start Streaming” to record from your device’s microphone and to test the service. It’s pretty neat, but you’re likely after more than this.
From the sidebar, select “Transcription Jobs” and click “Create Job.” The job serves as a method of automating transcription. Each job works on one file at a time; to automate the transcription of multiple files, you need to create a seperate job for each one from the command line.
Give Transcribe a path to the audio file you’d like to convert. You can optionally manually select the format and sample rate, though it should automatically recognize most common ones.
Once you click create, the transcription begins. The newly created job appears in the list, and once it’s done, you can download the transcribed text.
You probably also want to know how to work with Transcribe from the console, as creating jobs by hand is tedious and only suitable if you’re processing one large audio file at a time.
This starts the job and outputs some JSON telling you if it created successfully. You can check the status of a job programmatically with get-transcription-job:
If it’s finished, TranscriptionJob.TranscriptionJobStatus sets to “COMPLETED,” and you can download the file directly with curl and a little jq processing:
Note that the transcript file is JSON, and it contains the full transcript plus a confidence assessment of each word and the alternatives. Unless you want all the confidence values, you can filter them off with the final | jq “.results.transcripts” statement.
You can also automatically transcribe audio files using Lambda functions. Lambda is a service that can run code in response to AWS events, such as new items being uploaded to S3. It’s serverless, and you only pay for execution time; because Lambda isn’t doing the actual processing, just creating a new job on upload, the cost should be trivial.
You can code it yourself if you’ve used Lambda before, but luckily there’s a prebuilt application on the Lambda serverless app repository that can handle this exact job for you. It’s called s3-lambda-transcribe-audio-to-text-s3, and you may have to click “Show apps that create custom IAM roles” to find it.
Create a new app from this template, and specify the input bucket and output bucket. Make sure the output bucket exists and that the input bucket doesn’t, as the app will create the input bucket for you.
You’ll also want to enter the language of the audio file. en-US is generic English; for anything else, you can find the code on AWS’s docs.
Deploy the application, and you should see a newly created bucket. If you drop an audio file in this bucket, Lambda can create a new Transcribe job for you.
If the app doesn’t work, make sure you enabled it to create its IAM role, and make sure it has permission to work with Transcribe and the S3 buckets it needs to.