Speech Rate Task-Specific Representation Learning from Acoustic-Articulatory Data

Renuka Mannem, Hima Jyothi R., Aravind Illa, Prasanta Kumar Ghosh

In this work, speech rate is estimated using the task-specific representations which are learned from the acoustic-articulatory data, in contrast to generic representations which may not be optimal for the speech rate estimation. 1-D convolutional filters are used to learn speech rate specific acoustic representations from the raw speech. A convolutional dense neural network (CDNN) is used to estimate the speech rate from the learned representations. In practice, articulatory data is not directly available; thus, we use Acoustic-to-Articulatory Inversion (AAI) to derive the articulatory representations from acoustics. However, these pseudo-articulatory representations are also generic and not optimized for any task. To learn the speech-rate specific pseudo-articulatory representations, we propose a joint training of BLSTM-based AAI and CDNN using a weighted loss function that considers the losses corresponding to speech rate estimation and articulatory prediction. The proposed model yields an improvement in speech rate estimation by ~18.5% in terms of pearson correlation coefficient (CC) compared to the baseline CDNN model with generic articulatory representations as inputs. To utilize complementary information from articulatory features, we further perform experiments by concatenating task-specific acoustic and pseudo-articulatory representations, which yield an improvement in CC by ~2.5% compared to the baseline CDNN model.

 DOI: 10.21437/Interspeech.2020-2259

Cite as: Mannem, R., R., H.J., Illa, A., Ghosh, P.K. (2020) Speech Rate Task-Specific Representation Learning from Acoustic-Articulatory Data. Proc. Interspeech 2020, 2892-2896, DOI: 10.21437/Interspeech.2020-2259.

  author={Renuka Mannem and Hima Jyothi R. and Aravind Illa and Prasanta Kumar Ghosh},
  title={{Speech Rate Task-Specific Representation Learning from Acoustic-Articulatory Data}},
  booktitle={Proc. Interspeech 2020},