Improving Mispronunciation Detection for Non-Native Learners with Multisource Information and LSTM-Based Deep Models

Wei Li, Nancy F. Chen, Sabato Marco Siniscalchi, Chin-Hui Lee


In this paper, we utilize manner and place of articulation features and deep neural network models (DNNs) with long short-term memory (LSTM) to improve the detection performance of phonetic mispronunciations produced by second language learners. First, we show that speech attribute scores are complementary to conventional phone scores, so they can be concatenated as features to improve a baseline system based only on phone information. Next, pronunciation representation, usually calculated by frame-level averaging in a DNN, is now learned by LSTM, which directly uses sequential context information to embed a sequence of pronunciation scores into a pronunciation vector to improve the performance of subsequent mispronunciation detectors. Finally, when both proposed techniques are incorporated into the baseline phone-based GOP (goodness of pronunciation) classifier system trained on the same data, the integrated system reduces the false acceptance rate (FAR) and false rejection rate (FRR) by 37.90% and 38.44% (relative), respectively, from the baseline system.


 DOI: 10.21437/Interspeech.2017-464

Cite as: Li, W., Chen, N.F., Siniscalchi, S.M., Lee, C. (2017) Improving Mispronunciation Detection for Non-Native Learners with Multisource Information and LSTM-Based Deep Models. Proc. Interspeech 2017, 2759-2763, DOI: 10.21437/Interspeech.2017-464.


@inproceedings{Li2017,
  author={Wei Li and Nancy F. Chen and Sabato Marco Siniscalchi and Chin-Hui Lee},
  title={Improving Mispronunciation Detection for Non-Native Learners with Multisource Information and LSTM-Based Deep Models},
  year=2017,
  booktitle={Proc. Interspeech 2017},
  pages={2759--2763},
  doi={10.21437/Interspeech.2017-464},
  url={http://dx.doi.org/10.21437/Interspeech.2017-464}
}