Incremental TTS for Japanese Language

Tomoya Yanagita, Sakriani Sakti, Satoshi Nakamura

Simultaneous lecture translation requires speech to be translated in real time before the speaker has spoken an entire sentence since a long delay will create difficulties for the listeners trying to follow the lecture. The challenge is to construct a full-fledged system with speech recognition, machine translation and text-to-speech synthesis (TTS) components that could produce high-quality speech translations on the fly. Specifically for a TTS, this poses problems as a conventional framework commonly requires the language-dependent contextual linguistics of a full sentence to produce a natural-sounding speech waveform. Several studies have proposed ways for an incremental TTS (ITTS), in which it can estimate the target prosody from only partial knowledge of the sentence. However, most investigations are being done only in French, English and German. French is a syllable-timed language and the others are stress-timed languages. The Japanese language, which is a mora-timed language, has not been investigated so far. In this paper, we evaluate the quality of Japanese synthesized speech based on various linguistic and temporal incremental units. Experimental results reveal that an accent phrase incremental unit (a group of moras) is essential for a Japanese ITTS as a trade-off between quality and synthesis units.

 DOI: 10.21437/Interspeech.2018-1561

Cite as: Yanagita, T., Sakti, S., Nakamura, S. (2018) Incremental TTS for Japanese Language. Proc. Interspeech 2018, 902-906, DOI: 10.21437/Interspeech.2018-1561.

  author={Tomoya Yanagita and Sakriani Sakti and Satoshi Nakamura},
  title={Incremental TTS for Japanese Language},
  booktitle={Proc. Interspeech 2018},