Speech to Text Software vs. Human Transcriber: Fight!



We’ve been told countless times that the end of human transcription services is near—that voice recognition devices and speech to text software are going to take our jobs and take over the world!


Not so fast, Alexa!


There’s no denying that voice recognition technology has evolved into a super AI that can recognize human speech and convert them into written form. However, it has not yet reached human mode, much less God mode. Although accuracy has improved by leaps and bounds, voice recognition apps still can’t quite fully crack the human language.


You see, human language is too complex for any voice recognition app to fully understand. Don’t get us wrong, speech to text apps are very impressive and they do their jobs quite well when they’re dealing with common everyday words. The more sophisticated apps are fantastic at picking up some complex words when trained to do so. But they still don’t match up to human transcribers.


How Speech Recognition Software Works


The funny thing about speech recognition software is that you can have an actual conversation with a non-living object and people won’t think you’re crazy. So it’s totally fine if you have developed a friendship with Siri, Alexa, or even Google Translate.


Isn’t it fascinating that computers and devices can now “understand” human sounds and convert them to words and phrases that actually make sense? Thanks to the power of Artificial Intelligence, we can now interact with devices and use them as a tool to improve our productivity in ways we couldn’t have imagined decades ago.


With speech recognition technology, the audio is broken down into individual sounds and converted into digital format. It uses complex algorithms and sophisticated models to find the most probably word fit in the particular language. It’s able to find the best fit from its massive built-in dictionary.

Of course, this is an oversimplification of a complex process. Let’s take Alexa, for instance. How on earth is she able to recognize sounds, understand the words, and respond with a meaningful and valuable output?


With the way Alexa responds to your command, you would think that there’s a real person named Alexa inside your Amazon device. Everything is made possible by machine learning and neural networks. With the use of natural language processing.


Alexa responds to a trigger word, which is “Alexa”. When she detects this, she knows that a command is coming and eagerly waits for it.


Using complex algorithm, Alexa is programmed to detect whether what you said matches the word “Alexa”. Using speech recognition, Alexa will then detect the command. The words in audio form will then be translated into a text transcript.


What’s even more amazing is that Alexa can recognize intent in which the algorithm matches the input (words and phrases) with a pre-programmed list. Of course, as Alexa’s human, you would have to stick to the list of commands that Alexa would understand. If you deviate from the list, Alexa will not be able to understand.


Alexa will then execute your command by formulating a response in text and then breaking it apart into individual sound which then plays back to a speaker (smart device). This is essentially the same process used by software that transcribes audio to text.



How Accurate Are Speech to Text Software?


While we are in awe of what speech to text software can do, there’s still the question of accuracy.


Accuracy is a hit or miss when it comes to speech to text software. We can say that the accuracy depends largely on how clearly the speakers say the words. When speakers deliberately slow down and enunciate every word, you can expect the software to pick up the words accurately.


But this is not how people normally speak when they are in a meeting, in an interview, on a lecture, or in a podcast. They don’t adjust to the limitations of the software. Because if they do, they would sound strange.

Accuracy starts to take a hit when factors like slangs, accents, and mispronunciations enter the picture. The diversity and complexity of the human language can cause the software to go on a technical meltdown. This results in inaccurate information that oftentimes doesn’t make sense. In short, the product is not client-ready.


When this happens, we turn to the editing skills of transcribers. Human transcribers.


The accuracy of even the most sophisticated speech to text software is not at par with the accuracy of the human ears. Unlike human transcriptionists, the software doesn’t know context. They don’t analyze the words as they relate to the topic or discussion. Sad to say, they are only as good as their algorithm.


Although speech recognition software uses models such as the Hidden Markov Model to differentiate different pronunciations and accents, it still doesn’t quite match up to human accuracy.


We can appreciate that text-to-speech technology exists to make work processes faster. There’s no doubt that in the future, technology will find a way to narrow the gap. But until then, human transcription is still the most accurate way to convert audio to text.


If you need help with audio file transcription, contact us and we'll be more than happy to assist you.


Featured Posts
Recent Posts
Archive
Search By Tags
Follow Us
  • Facebook Basic Square
  • Twitter Basic Square
  • Google+ Basic Square