Google’s Next-Generation Real-Time Unit-Selection Synthesizer Using Sequence-to-Sequence LSTM-Based Autoencoders

Vincent Wan, Yannis Agiomyrgiannakis, Hanna Silen, Jakub Vít


A neural network model that significant improves unit-selection-based Text-To-Speech synthesis is presented. The model employs a sequence-to-sequence LSTM-based autoencoder that compresses the acoustic and linguistic features of each unit to a fixed-size vector referred to as an embedding. Unit-selection is facilitated by formulating the target cost as an L2 distance in the embedding space. In open-domain speech synthesis the method achieves a 0.2 improvement in the MOS, while for limited-domain it reaches the cap of 4.5 MOS. Furthermore, the new TTS system halves the gap between the previous unit-selection system and WaveNet in terms of quality while retaining low computational cost and latency.


 DOI: 10.21437/Interspeech.2017-1107

Cite as: Wan, V., Agiomyrgiannakis, Y., Silen, H., Vít, J. (2017) Google’s Next-Generation Real-Time Unit-Selection Synthesizer Using Sequence-to-Sequence LSTM-Based Autoencoders. Proc. Interspeech 2017, 1143-1147, DOI: 10.21437/Interspeech.2017-1107.


@inproceedings{Wan2017,
  author={Vincent Wan and Yannis Agiomyrgiannakis and Hanna Silen and Jakub Vít},
  title={Google’s Next-Generation Real-Time Unit-Selection Synthesizer Using Sequence-to-Sequence LSTM-Based Autoencoders},
  year=2017,
  booktitle={Proc. Interspeech 2017},
  pages={1143--1147},
  doi={10.21437/Interspeech.2017-1107},
  url={http://dx.doi.org/10.21437/Interspeech.2017-1107}
}