Acoustic-to-Articulatory Inversion with Deep Autoregressive Articulatory-WaveNet

Narjes Bozorg, Michael T. Johnson


This paper presents a novel deep autoregressive method for Acoustic-to-Articulatory Inversion called Articulatory-WaveNet. In traditional methods such as Gaussian Mixture Model-Hidden Markov Model (GMM-HMM), mapping the frame-level interdependency of observations has not been considered. We address this problem by introducing the Articulatory-WaveNet with dilated causal convolutional layers to predict the articulatory trajectories from acoustic feature sequences. This new model has an average Root Mean Square Error (RMSE) of 1.08mm and a correlation of 0.82 on the English speaker subset of the ElectroMagnetic Articulography-Mandarin Accented English (EMA-MAE) corpus. Articulatory-WaveNet represents an improvement of 59% for RMSE and 30% for correlation over the previous GMM-HMM based inversion model. To the best of our knowledge, this paper introduces the first application of a WaveNet synthesis approach to the problem of Acoustic-to-Articulatory Inversion, and results are comparable to or better than the best currently published systems.


 DOI: 10.21437/Interspeech.2020-1875

Cite as: Bozorg, N., Johnson, M.T. (2020) Acoustic-to-Articulatory Inversion with Deep Autoregressive Articulatory-WaveNet. Proc. Interspeech 2020, 3725-3729, DOI: 10.21437/Interspeech.2020-1875.


@inproceedings{Bozorg2020,
  author={Narjes Bozorg and Michael T. Johnson},
  title={{Acoustic-to-Articulatory Inversion with Deep Autoregressive Articulatory-WaveNet}},
  year=2020,
  booktitle={Proc. Interspeech 2020},
  pages={3725--3729},
  doi={10.21437/Interspeech.2020-1875},
  url={http://dx.doi.org/10.21437/Interspeech.2020-1875}
}