UNSW System Description for the Shared Task on Automatic Speech Recognition for Non-Native Children’s Speech

Mostafa Shahin, Renée Lu, Julien Epps, Beena Ahmed


In this paper we describe our children’s Automatic Speech Recognition (ASR) system for the first shared task on ASR for English non-native children’s speech. The acoustic model comprises 6 Convolutional Neural Network (CNN) layers and 12 Factored Time-Delay Neural Network (TDNN-F) layers, trained by data from 5 different children’s speech corpora. Speed perturbation, Room Impulse Response (RIR), babble noise and non-speech noise data augmentation methods were utilized to enhance the model robustness. Three Language Models (LMs) were employed: an in-domain LM trained on written data and speech transcriptions of non-native children, a LM trained on non-native written data and transcription of both native and non-native children’s speech and a TEDLIUM LM trained on adult TED talks transcriptions. Lattices produced from the different ASR systems were combined and decoded using the Minimum Bayes-Risk (MBR) decoding algorithm to get the final output. Our system achieved a final Word Error Rate (WER) of 17.55% and 16.59% for both developing and testing sets respectively and ranked second among the 10 teams participating in the task.


 DOI: 10.21437/Interspeech.2020-3111

Cite as: Shahin, M., Lu, R., Epps, J., Ahmed, B. (2020) UNSW System Description for the Shared Task on Automatic Speech Recognition for Non-Native Children’s Speech. Proc. Interspeech 2020, 265-268, DOI: 10.21437/Interspeech.2020-3111.


@inproceedings{Shahin2020,
  author={Mostafa Shahin and Renée Lu and Julien Epps and Beena Ahmed},
  title={{UNSW System Description for the Shared Task on Automatic Speech Recognition for Non-Native Children’s Speech}},
  year=2020,
  booktitle={Proc. Interspeech 2020},
  pages={265--268},
  doi={10.21437/Interspeech.2020-3111},
  url={http://dx.doi.org/10.21437/Interspeech.2020-3111}
}