MultiSpeech: Multi-Speaker Text to Speech with Transformer

Mingjian Chen, Xu Tan, Yi Ren, Jin Xu, Hao Sun, Sheng Zhao, Tao Qin

Transformer-based text to speech (TTS) model (e.g., Transformer TTS [1], FastSpeech [2]) has shown the advantages of training and inference efficiency over RNN-based model (e.g., Tacotron [3]) due to its parallel computation in training and/or inference. However, the parallel computation increases the difficulty while learning the alignment between text and speech in Transformer, which is further magnified in the multi-speaker scenario with noisy data and diverse speakers, and hinders the applicability of Transformer for multi-speaker TTS. In this paper, we develop a robust and high-quality multi-speaker Transformer TTS system called MultiSpeech, with several specially designed components/techniques to improve text-to-speech alignment: 1) a diagonal constraint on the weight matrix of encoder-decoder attention in both training and inference; 2) layer normalization on phoneme embedding in encoder to better preserve position information; 3) a bottleneck in decoder pre-net to prevent copy between consecutive speech frames. Experiments on VCTK and LibriTTS multi-speaker datasets demonstrate the effectiveness of MultiSpeech: 1) it synthesizes more robust and better quality multi-speaker voice than naive Transformer based TTS; 2) with a MultiSpeech model as the teacher, we obtain a strong multi-speaker FastSpeech model with almost zero quality degradation while enjoying extremely fast inference speed.

 DOI: 10.21437/Interspeech.2020-3139

Cite as: Chen, M., Tan, X., Ren, Y., Xu, J., Sun, H., Zhao, S., Qin, T. (2020) MultiSpeech: Multi-Speaker Text to Speech with Transformer. Proc. Interspeech 2020, 4024-4028, DOI: 10.21437/Interspeech.2020-3139.

  author={Mingjian Chen and Xu Tan and Yi Ren and Jin Xu and Hao Sun and Sheng Zhao and Tao Qin},
  title={{MultiSpeech: Multi-Speaker Text to Speech with Transformer}},
  booktitle={Proc. Interspeech 2020},