The Human and Robot vocal interaction is composed of tree functional subsets: Speach to text, Text processing, Text to speach.
The speach to text subset process the voice signal into text according to a given language (dictionary and grammar) and additional constraints (list of key word or sentences). In the current video we use the pocketSphynx speach to text library. Its performance is good but quickly decrease in a noisy environment. We plan to use a commercial product from Nuance in order to increase our voice recognition perfomance.
The Text processing consists in extracting relevant information from the text (like order and specific object) in order to build a machine readable order. To do so, the Text processing bloc needs to interact with Human and ask for step confirmation.
In order to interact with human, the Text to speach function subset transforms textual information into vocal data. We currently use 2 existing librairies , Espeak and Mbrola that offers good text to vocal translation.