Reformer-TTS: Neural Speech Synthesis with Reformer Network

Hyeong Rae Ihm, Joun Yeop Lee, Byoung Jin Choi, Sung Jun Cheon, Nam Soo Kim

Recent End-to-end text-to-speech (TTS) systems based on the deep neural network (DNN) have shown the state-of-the-art performance on the speech synthesis field. Especially, the attention-based sequence-to-sequence models have improved the quality of the alignment between the text and spectrogram successfully. Leveraging such improvement, speech synthesis using a Transformer network was reported to generate humanlike speech audio. However, such sequence-to-sequence models require intensive computing power and memory during training. The attention scores are calculated over the entire key at every query sequence, which increases memory usage. To mitigate this issue, we propose Reformer-TTS, the model using a Reformer network which utilizes the locality-sensitive hashing attention and the reversible residual network. As a result, we show that the Reformer network consumes almost twice smaller memory margin as the Transformer, which leads to the fast convergence of training end-to-end TTS system. We demonstrate such advantages with memory usage, objective, and subjective performance evaluation.

 DOI: 10.21437/Interspeech.2020-2189

Cite as: Ihm, H.R., Lee, J.Y., Choi, B.J., Cheon, S.J., Kim, N.S. (2020) Reformer-TTS: Neural Speech Synthesis with Reformer Network. Proc. Interspeech 2020, 2012-2016, DOI: 10.21437/Interspeech.2020-2189.

  author={Hyeong Rae Ihm and Joun Yeop Lee and Byoung Jin Choi and Sung Jun Cheon and Nam Soo Kim},
  title={{Reformer-TTS: Neural Speech Synthesis with Reformer Network}},
  booktitle={Proc. Interspeech 2020},