Bi-Level Speaker Supervision for One-Shot Speech Synthesis

Tao Wang, Jianhua Tao, Ruibo Fu, Jiangyan Yi, Zhengqi Wen, Chunyu Qiang

The gap between speaker characteristics of reference speech and synthesized speech remains a challenging problem in one-shot speech synthesis. In this paper, we propose a bi-level speaker supervision framework to close the speaker characteristics gap via supervising the synthesized speech at speaker feature level and speaker identity level. The speaker feature extraction and speaker identity reconstruction are integrated in an end-to-end speech synthesis network, with the one on speaker feature level for closing speaker characteristics and the other on speaker identity level for preserving identity information. This framework guarantees that the synthesized speech has similar speaker characteristics to original speech, and it also ensures the distinguishability between different speakers. Additionally, to solve the influence of speech content on speaker feature extraction task, we propose a text-independent reference encoder (ti-reference encoder) module to extract speaker feature. Experiments on LibriTTS dataset show that our model is able to generate the speech similar to target speaker. Furthermore, we demonstrate that this model can learn meaningful speaker representations by bi-level speaker supervision and ti-reference encoder module.

 DOI: 10.21437/Interspeech.2020-1737

Cite as: Wang, T., Tao, J., Fu, R., Yi, J., Wen, Z., Qiang, C. (2020) Bi-Level Speaker Supervision for One-Shot Speech Synthesis. Proc. Interspeech 2020, 3989-3993, DOI: 10.21437/Interspeech.2020-1737.

  author={Tao Wang and Jianhua Tao and Ruibo Fu and Jiangyan Yi and Zhengqi Wen and Chunyu Qiang},
  title={{Bi-Level Speaker Supervision for One-Shot Speech Synthesis}},
  booktitle={Proc. Interspeech 2020},