Investigating Effective Additional Contextual Factors in DNN-Based Spontaneous Speech Synthesis

Yuki Yamashita, Tomoki Koriyama, Yuki Saito, Shinnosuke Takamichi, Yusuke Ijima, Ryo Masumura, Hiroshi Saruwatari


In this paper, we investigate the effectiveness of using rich annotations in deep neural network (DNN)-based statistical speech synthesis. General text-to-speech synthesis frameworks for reading-style speech use text-dependent information referred to as context. However, to achieve more human-like speech synthesis, we should take paralinguistic and nonlinguistic features into account. We focus on adding contextual features to the input features of DNN-based speech synthesis using spontaneous speech corpus with rich tags including paralinguistic and nonlinguistic features such as prosody, disfluency, and morphological features. Through experimental evaluations, we investigate the effectiveness of additional contextual factors and show which factors enhance the naturalness as spontaneous speech. This paper contributes as a guide to data collection for speech synthesis.


 DOI: 10.21437/Interspeech.2020-2469

Cite as: Yamashita, Y., Koriyama, T., Saito, Y., Takamichi, S., Ijima, Y., Masumura, R., Saruwatari, H. (2020) Investigating Effective Additional Contextual Factors in DNN-Based Spontaneous Speech Synthesis. Proc. Interspeech 2020, 3201-3205, DOI: 10.21437/Interspeech.2020-2469.


@inproceedings{Yamashita2020,
  author={Yuki Yamashita and Tomoki Koriyama and Yuki Saito and Shinnosuke Takamichi and Yusuke Ijima and Ryo Masumura and Hiroshi Saruwatari},
  title={{Investigating Effective Additional Contextual Factors in DNN-Based Spontaneous Speech Synthesis}},
  year=2020,
  booktitle={Proc. Interspeech 2020},
  pages={3201--3205},
  doi={10.21437/Interspeech.2020-2469},
  url={http://dx.doi.org/10.21437/Interspeech.2020-2469}
}