Noise Robust Acoustic to Articulatory Speech Inversion

Nadee Seneviratne, Ganesh Sivaraman, Vikramjit Mitra, Carol Espy-Wilson

In previous work, we have shown that using articulatory features derived from a speech inversion system trained using synthetic data can significantly improve the robustness of an automatic speech recognition (ASR) system. This paper presents results from the first of two steps needed for exploring if the same will hold true for a speech inversion system trained with natural speech. Specifically, we developed a noise robust multi-speaker acoustic to articulatory speech inversion system. A feed forward neural network was trained using contextualized mel-frequency cepstral coefficients (MFCC) as the input acoustic features and six tract-variable (TV) trajectories as the output articulatory features. Experiments were performed on the U. Wisc. X-ray Microbeam (XRMB) database with 8 noise types artificially added at 5 different SNRs. Performance of the system was measured by computing the correlation between estimated and actual TVs. The performance of the multi-condition trained system was compared to the clean-speech trained system. The effect of speech enhancement on TV estimation was also evaluated. Experiments showed a 10% relative improvement in correlation over the baseline clean-speech trained system.

 DOI: 10.21437/Interspeech.2018-1509

Cite as: Seneviratne, N., Sivaraman, G., Mitra, V., Espy-Wilson, C. (2018) Noise Robust Acoustic to Articulatory Speech Inversion. Proc. Interspeech 2018, 3137-3141, DOI: 10.21437/Interspeech.2018-1509.

  author={Nadee Seneviratne and Ganesh Sivaraman and Vikramjit Mitra and Carol Espy-Wilson},
  title={Noise Robust Acoustic to Articulatory Speech Inversion},
  booktitle={Proc. Interspeech 2018},