Automatic Miscue Detection Using RNN Based Models with Data Augmentation

Yoon Seok Hong, Kyung Seo Ki, Gahgene Gweon

This study proposes a method of using data augmentation to address the problem of data shortages in miscue detection tasks. Three main steps were taken. First, a phoneme classifier was developed to acquire force-aligned data, which would be used for miscue classification and data augmentation. In order to create the phoneme classifier, phonetic features of “Seoul Reading Speech” (SRS) corpus were extracted by using grapheme-to-phoneme (G2P) to train CNN-based models. Second, to obtain miscue labeled corpus, we performed data augmentation using the phoneme classifier output, which is artificially generated miscue corpus of SRS (modified-SRS). This miscue corpus was created by randomly deleting or modifying sound sections according to three miscue categories; extension (EXT), pause (PAU) and pre-correction (PRE). Third, the performance of the miscue classifier was tested after training three types of RNN based models (LSTM, BiLSTM, BiGRU) with the modified-SRS corpus. The results show that the BiGRU model performed best at 0.819 in F1-score on augmented data, while BiLSTM model performed best at 0.512 on real data.

 DOI: 10.21437/Interspeech.2018-1644

Cite as: Hong, Y.S., Ki, K.S., Gweon, G. (2018) Automatic Miscue Detection Using RNN Based Models with Data Augmentation. Proc. Interspeech 2018, 1646-1650, DOI: 10.21437/Interspeech.2018-1644.

  author={Yoon Seok Hong and Kyung Seo Ki and Gahgene Gweon},
  title={Automatic Miscue Detection Using RNN Based Models with Data Augmentation},
  booktitle={Proc. Interspeech 2018},