Fourth Workshop on Child, Computer and Interaction (WOCCI 2014)
Adult-targeted automatic speech recognition (ASR) has made significant advancements in recent years and can produce speech-to-text output with very low word-error-rate, for multiple languages, and in various types of noisy environments, e.g. car noise, living-room, outdoor-noise, etc. But when it comes to child speech, little is available at the performance level of adult targeted ASR. It requires a considerable amount of data to build an ASR for naturally spoken, spontaneous, and continuous child speech. In this study, we show that using a minimal amount of data we adapt multiple components of a state-of-the-art adult centric large vocabulary continuous speech recognition (LVCSR) system to form a child specific LVCSR system. The resulting ASR system improves the accuracy for children speaking US English to living room electronic devices (LRED), e.g. a voice-operated TV or computer. Techniques we explore in this paper include vocal tract length normalization, acoustic model adaptation, language model adaptation with childspecific content lists and grammars, as well as a neural network based approach to automatically classify child data. The combined initiative towards child-specific ASR system for the LRED domain results in relative WER improvement of 27.2% compared to adult-targeted models.
Index Terms: childrens speech, automatic speech recognition, acoustic adaptation, language model adaptation, large vocabulary continuous speech recognition.
Bibliographic reference. Gray, Sharmistha S. / Willett, Daniel / Lu, Jianhua / Pinto, Joel / Maergner, Paul / Bodenstab, Nathan (2014): "Child automatic speech recognition for US English: child interaction with living-room-electronic-devices", In WOCCI-2014, 21-26.