Exploiting Cross-Domain Visual Feature Generation for Disordered Speech Recognition

Shansong Liu, Xurong Xie, Jianwei Yu, Shoukang Hu, Mengzhe Geng, Rongfeng Su, Shi-Xiong Zhang, Xunying Liu, Helen Meng

Audio-visual speech recognition (AVSR) technologies have been successfully applied to a wide range of tasks. When developing AVSR systems for disordered speech characterized by severe degradation of voice quality and large mismatch against normal, it is difficult to record large amounts of high quality audio-visual data. In order to address this issue, a cross-domain visual feature generation approach is proposed in this paper. Audio-visual inversion DNN system constructed using widely available out-of-domain audio-visual data was used to generate visual features for disordered speakers for whom video data is either very limited or unavailable. Experiments conducted on the UASpeech corpus suggest that the proposed cross-domain visual feature generation based AVSR system consistently outperformed the baseline ASR system and AVSR system using original visual features. An overall word error rate reduction of 3.6% absolute (14% relative) was obtained over the previously published best system on the 8 UASpeech dysarthric speakers with audio-visual data of the same task.

 DOI: 10.21437/Interspeech.2020-2282

Cite as: Liu, S., Xie, X., Yu, J., Hu, S., Geng, M., Su, R., Zhang, S., Liu, X., Meng, H. (2020) Exploiting Cross-Domain Visual Feature Generation for Disordered Speech Recognition. Proc. Interspeech 2020, 711-715, DOI: 10.21437/Interspeech.2020-2282.

  author={Shansong Liu and Xurong Xie and Jianwei Yu and Shoukang Hu and Mengzhe Geng and Rongfeng Su and Shi-Xiong Zhang and Xunying Liu and Helen Meng},
  title={{Exploiting Cross-Domain Visual Feature Generation for Disordered Speech Recognition}},
  booktitle={Proc. Interspeech 2020},