Automatic Scoring at Multi-Granularity for L2 Pronunciation

Binghuai Lin, Liyuan Wang, Xiaoli Feng, Jinsong Zhang

Automatic pronunciation assessment and error detection play an important part of Computer-Assisted Pronunciation Training (CAPT). Traditional approaches normally focus on scoring of sentences, words or mispronunciation detection of phonemes independently without considering the hierarchical and contextual relationships among them. In this paper, we develop a hierarchical network which combines scoring at the granularity of phoneme, word and sentence jointly. Specifically, we achieve the phoneme scores by a semi-supervised phoneme mispronunciation detection method, the words scores by an attention mechanism, and the sentence scores by a non-linear regression method. To further model the correlation between the sentence and phoneme, we optimize the network by a multitask learning framework (MTL). The proposed framework relies on a few sentence-level labeled data and a majority of unlabeled data. We evaluate the network performance on a multi-granular dataset consisting of sentences, words and phonemes which was recorded by 1,000 Chinese speakers and labeled by three experts. Experimental results show that the proposed method is well correlated with human raters with a Pearson correlation coefficient (PCC) of 0.88 at sentence level and 0.77 at word level. Furthermore, the semi-supervised phoneme mispronunciation detection achieves a comparable result by F1-measure with our supervised baseline.

 DOI: 10.21437/Interspeech.2020-1282

Cite as: Lin, B., Wang, L., Feng, X., Zhang, J. (2020) Automatic Scoring at Multi-Granularity for L2 Pronunciation. Proc. Interspeech 2020, 3022-3026, DOI: 10.21437/Interspeech.2020-1282.

  author={Binghuai Lin and Liyuan Wang and Xiaoli Feng and Jinsong Zhang},
  title={{Automatic Scoring at Multi-Granularity for L2 Pronunciation}},
  booktitle={Proc. Interspeech 2020},