End-to-End Text-to-Speech Synthesis with Unaligned Multiple Language Units Based on Attention

Masashi Aso, Shinnosuke Takamichi, Hiroshi Saruwatari


This paper presents the use of unaligned multiple language units for end-to-end text-to-speech (TTS). End-to-end TTS is a promising technology in that it does not require intermediate representation such as prosodic contexts. However, it causes mispronunciation and unnatural prosody. To alleviate this problem, previous methods have used multiple language units, e.g., phonemes and characters, but required the units to be hard-aligned. In this paper, we propose a multi-input attention structure that simultaneously accepts multiple language units without alignments among them. We consider using not only traditional phonemes and characters but also subwords tokenized in a language-independent manner. We also propose a progressive training strategy to deal with the unaligned multiple language units. The experimental results demonstrated that our model and training strategy improve speech quality.


 DOI: 10.21437/Interspeech.2020-2347

Cite as: Aso, M., Takamichi, S., Saruwatari, H. (2020) End-to-End Text-to-Speech Synthesis with Unaligned Multiple Language Units Based on Attention. Proc. Interspeech 2020, 4009-4013, DOI: 10.21437/Interspeech.2020-2347.


@inproceedings{Aso2020,
  author={Masashi Aso and Shinnosuke Takamichi and Hiroshi Saruwatari},
  title={{End-to-End Text-to-Speech Synthesis with Unaligned Multiple Language Units Based on Attention}},
  year=2020,
  booktitle={Proc. Interspeech 2020},
  pages={4009--4013},
  doi={10.21437/Interspeech.2020-2347},
  url={http://dx.doi.org/10.21437/Interspeech.2020-2347}
}