Speech Recognition has been evolving over the last few decades. The recent interest in Machine Learning technologies have also added a new boost to the speech recognition domain. There are a few Speech Recognition solutions available in the market, The major contenders are still the big companies such as Microsoft, Google and IBM.
The need for custom models for Speech Recognition comes from the fact that the model available is intentionally made for general broad domain use and is trained for a near to perfect environment (this means that there is almost no background noise, regular speech patterns and optimum audio quality). Although, general usage is fantastic and completely acceptable for broad domain applications but this results in a drop in accuracy sometimes. This can be used for transcriptions of casual conversations that aren’t related to a specific domain.
Generally there are 2 different types of custom models:
This model requires a corpora to be provided to it related to the domain of conversation that needs to be transcribed. The model than trains itself on these sentences, extracts the vocabulary and identifies the domain it is related to.
This model trains on provided audios with a real world environment (with background or microphone noise and non-native accents).
The drop in accuracy can be better explained with a hypothetical human scenario, when an IT Expert finds himself between a group of Cardiac specialists and one of them cracks a joke about a defibrillator. All of them chuckle while he doesn’t even know what a defibrillator is. Never happened! thus the context matters a lot and same goes for speech recognition services.
If the target is to achieve higher accuracy with your solution, it is preferred to identify the domain to which this conversation is related to. This narrows down the vocabulary for the engine to transcribe and in return we achieve greater efficiency and improved accuracy.
In order to accomplish higher accuracy we tried IBM’s Watson, Microsoft’s Bing, and Google’s speech to text APIs and concluded to move towards IBM Watson’s Speech to Text due to its accuracy and other features. Although Bing’s Speech Api does also provide support for custom language and speech models, but the general accuracy of Watson is what made it stand out for us.
Detailed comparison is available at speech to text services comparison chart.
IBM’s Watson support both these types of custom training models and provides an interface to add custom models to your recognition services.
As mentioned earlier, using a custom acoustic model trained with environment similar to yours will increase the accuracy of your transcriptions. The cons for this, that one can think of, are providing the initial environment samples, which should at least be 10 minutes of audio (recommended 50 minutes) and inaccurate transcriptions of vocabulary.
Similarly, utilizing custom language model can increase accuracy in terms of vocabulary by narrowing the corpora for the conversation. But this model, does not take into account the environment or the accent, resultant custom language model would work great with conversations in its native accents and perfect environment but would result in a decreased accuracy with non native pronunciations and bad audio quality.
So based on this, both models can be utilized in combination with each other. Let’s say, we have an audio conversation for which we have the correct transcription available. This combination of Acoustic and Language models would result in greater accuracy of transcriptions in conversations (with relative noisy environments) and narrowed down corpora of the language model.
Coming back to Watson, the speech to text provides an interface to add custom models to your recognition services. The engine is trained on these models. After the training is complete, the recognition request of speech with these models will improve the accuracy of the transcription. Watson provides us with two different types of custom model.
Which one to use? Both. The custom acoustic model needs at-least 10 minutes of audio to start with. The transcription of these audios can be added to a custom language model and then both combined can be utilized for training. The speech similar to the environment and domain, the model is trained on, will result in greater accuracy in transcription.