Spoken Content and Voice Factorization for Few-Shot Speaker Adaptation

Tao Wang, Jianhua Tao, Ruibo Fu, Jiangyan Yi, Zhengqi Wen, Rongxiu Zhong

The low similarity and naturalness of synthesized speech remain a challenging problem for speaker adaptation with few resources. Since the acoustic model is too complex to interpret, overfitting will occur when training with few data. To prevent the model from overfitting, this paper proposes a novel speaker adaptation framework that decomposes the parameter space of the end-to-end acoustic model into two parts, with the one on predicting spoken content and the other on modeling speaker’s voice. The spoken content is represented by phone posteriorgram (PPG) which is speaker independent. By adapting the two sub-modules separately, the overfitting can be alleviated effectively. Moreover, we propose two different adaptation strategies based on whether the data has text annotation. In this way, speaker adaptation can also be performed without text annotations. Experimental results confirm the adaptability of our proposed method of factorizating spoken content and voice. Listening tests demonstrate that our proposed method can achieve better performance with just 10 sentences than speaker adaptation conducted on Tacotron in terms of naturalness and speaker similarity.

 DOI: 10.21437/Interspeech.2020-1745

Cite as: Wang, T., Tao, J., Fu, R., Yi, J., Wen, Z., Zhong, R. (2020) Spoken Content and Voice Factorization for Few-Shot Speaker Adaptation. Proc. Interspeech 2020, 796-800, DOI: 10.21437/Interspeech.2020-1745.

  author={Tao Wang and Jianhua Tao and Ruibo Fu and Jiangyan Yi and Zhengqi Wen and Rongxiu Zhong},
  title={{Spoken Content and Voice Factorization for Few-Shot Speaker Adaptation}},
  booktitle={Proc. Interspeech 2020},