Principal Style Components: Expressive Style Control and Cross-Speaker Transfer in Neural TTS

Alexander Sorin, Slava Shechtman, Ron Hoory


We propose a novel semi-supervised technique that enables expressive style control and cross-speaker transfer in neural text to speech (TTS), when available training data contains a limited amount of labeled expressive speech from a single speaker. The technique is based on unsupervised learning of a style-related latent space, generated by a previously proposed reference audio encoding technique, and transforming it by means of Principal Component Analysis to another low-dimensional space. The latter space represents style information in a purified form, disentangled from text and speaker-related information. Encodings for expressive styles that are present in the training data are easily constructed in this space. Furthermore, this technique provides control over the speech rate, pitch level, and articulation type that can be used for TTS voice transformation.

We present the results of subjective crowd evaluations confirming that the synthesized speech convincingly conveys the desired expressive styles and preserves a high level of quality.


 DOI: 10.21437/Interspeech.2020-1854

Cite as: Sorin, A., Shechtman, S., Hoory, R. (2020) Principal Style Components: Expressive Style Control and Cross-Speaker Transfer in Neural TTS. Proc. Interspeech 2020, 3411-3415, DOI: 10.21437/Interspeech.2020-1854.


@inproceedings{Sorin2020,
  author={Alexander Sorin and Slava Shechtman and Ron Hoory},
  title={{Principal Style Components: Expressive Style Control and Cross-Speaker Transfer in Neural TTS}},
  year=2020,
  booktitle={Proc. Interspeech 2020},
  pages={3411--3415},
  doi={10.21437/Interspeech.2020-1854},
  url={http://dx.doi.org/10.21437/Interspeech.2020-1854}
}