Articulatory-to-speech Conversion Using Bi-directional Long Short-term Memory

Fumiaki Taguchi, Tokihiko Kaburagi

Methods for synthesizing speech sounds from the motion of articulatory organs can be used to produce substitute speech for people who have undergone laryngectomy. To achieve this goal, feature parameters representing the spectral envelope of speech, directly related to the acoustic characteristics of the vocal tract, has been estimated from articulatory movements. Within this framework, speech can be synthesized by driving the filter obtained from a spectral envelope with noise signals. In the current study, we examined an alternative method that generates speech sounds directly from the motion pattern of articulatory organs based on the implicit relationships between articulatory movements and the source signal of speech. These implicit relationships were estimated by considering that articulatory movements are involved in phonological representations of speech that are also related to sound source information such as the temporal pattern of pitch and voiced/unvoiced flag. We developed a method for simultaneously estimating the spectral envelope and sound source parameters from articulatory data obtained with an electromagnetic articulography (EMA) sensor. Furthermore, objective evaluation of estimated speech parameters and subjective evaluation of the word error rate were performed to examine the effectiveness of our method.

 DOI: 10.21437/Interspeech.2018-999

Cite as: Taguchi, F., Kaburagi, T. (2018) Articulatory-to-speech Conversion Using Bi-directional Long Short-term Memory. Proc. Interspeech 2018, 2499-2503, DOI: 10.21437/Interspeech.2018-999.

  author={Fumiaki Taguchi and Tokihiko Kaburagi},
  title={Articulatory-to-speech Conversion Using Bi-directional Long Short-term Memory},
  booktitle={Proc. Interspeech 2018},