Fourth Workshop on Child, Computer and Interaction (WOCCI 2014)

September 19, 2014

Child Automatic Speech Recognition for US English: Child Interaction with Living-Room-Electronic-Devices

Sharmistha S. Gray (1), Daniel Willett (2), Jianhua Lu (1), Joel Pinto (2), Paul Maergner (2), Nathan Bodenstab (1)

(1) Nuance Communications Inc., Burlington, USA
(2) Nuance Communications GmbH, Aachen, Germany

Adult-targeted automatic speech recognition (ASR) has made significant advancements in recent years and can produce speech-to-text output with very low word-error-rate, for multiple languages, and in various types of noisy environments, e.g. car noise, living-room, outdoor-noise, etc. But when it comes to child speech, little is available at the performance level of adult targeted ASR. It requires a considerable amount of data to build an ASR for naturally spoken, spontaneous, and continuous child speech. In this study, we show that using a minimal amount of data we adapt multiple components of a state-of-the-art adult centric large vocabulary continuous speech recognition (LVCSR) system to form a child specific LVCSR system. The resulting ASR system improves the accuracy for children speaking US English to living room electronic devices (LRED), e.g. a voice-operated TV or computer. Techniques we explore in this paper include vocal tract length normalization, acoustic model adaptation, language model adaptation with childspecific content lists and grammars, as well as a neural network based approach to automatically classify child data. The combined initiative towards child-specific ASR system for the LRED domain results in relative WER improvement of 27.2% compared to adult-targeted models.

Index Terms: children’s speech, automatic speech recognition, acoustic adaptation, language model adaptation, large vocabulary continuous speech recognition.

