Dynamic Speaker Representations Adjustment and Decoder Factorization for Speaker Adaptation in End-to-End Speech Synthesis

Ruibo Fu, Jianhua Tao, Zhengqi Wen, Jiangyan Yi, Tao Wang, Chunyu Qiang


End-to-end speech synthesis can reach high quality and naturalness with low-resource adaptation data. However, the generalization of out-domain texts and the improving modeling accuracy of speaker representations are still challenging tasks. The limited adaptation data leads to unacceptable errors and low similarity of the synthetic speech. In this paper, both speaker representations modeling and acoustic model structure are improved for the speaker adaptation task. On the one hand, compared with the conventional methods that focused on using fixed global speaker representations, the attention gating is proposed to adjust speaker representations dynamically based on the attended context and prosody information, which can describe more pronunciation characteristics in phoneme level. On the other hand, to improve the robustness and avoid over-fitting, the decoder model is factored into average-net and adaptation-net, which are designed for learning speaker independent acoustic features and target speaker timbre imitation respectively. And the context discriminator is pre-trained by large ASR data to supervise the average-net generating proper speaker independent acoustic features for different phoneme. Experimental results on Mandarin dataset show that proposed methods lead to an improvement on intelligibility, naturalness and similarity.


 DOI: 10.21437/Interspeech.2020-1623

Cite as: Fu, R., Tao, J., Wen, Z., Yi, J., Wang, T., Qiang, C. (2020) Dynamic Speaker Representations Adjustment and Decoder Factorization for Speaker Adaptation in End-to-End Speech Synthesis. Proc. Interspeech 2020, 4701-4705, DOI: 10.21437/Interspeech.2020-1623.


@inproceedings{Fu2020,
  author={Ruibo Fu and Jianhua Tao and Zhengqi Wen and Jiangyan Yi and Tao Wang and Chunyu Qiang},
  title={{Dynamic Speaker Representations Adjustment and Decoder Factorization for Speaker Adaptation in End-to-End Speech Synthesis}},
  year=2020,
  booktitle={Proc. Interspeech 2020},
  pages={4701--4705},
  doi={10.21437/Interspeech.2020-1623},
  url={http://dx.doi.org/10.21437/Interspeech.2020-1623}
}