Cross-Lingual Text-To-Speech Synthesis via Domain Adaptation and Perceptual Similarity Regression in Speaker Space

Detai Xin, Yuki Saito, Shinnosuke Takamichi, Tomoki Koriyama, Hiroshi Saruwatari


We present a method for improving the performance of cross-lingual text-to-speech synthesis. Previous works are able to model speaker individuality in speaker space via speaker encoder but suffer from performance decreasing when synthesizing cross-lingual speech. This is because the speaker space formed by all speaker embeddings is completely language-dependent. In order to construct a language-independent speaker space, we regard cross-lingual speech synthesis as a domain adaptation problem and propose a training method to let the speaker encoder adapt speaker embedding of different languages into the same space. Furthermore, to improve speaker individuality and construct a human-interpretable speaker space, we propose a regression method to construct perceptually correlated speaker space. Experimental result demonstrates that our method could not only improve the performance of both cross-lingual and intra-lingual speech but also find perceptually similar speakers beyond languages.


 DOI: 10.21437/Interspeech.2020-2070

Cite as: Xin, D., Saito, Y., Takamichi, S., Koriyama, T., Saruwatari, H. (2020) Cross-Lingual Text-To-Speech Synthesis via Domain Adaptation and Perceptual Similarity Regression in Speaker Space. Proc. Interspeech 2020, 2947-2951, DOI: 10.21437/Interspeech.2020-2070.


@inproceedings{Xin2020,
  author={Detai Xin and Yuki Saito and Shinnosuke Takamichi and Tomoki Koriyama and Hiroshi Saruwatari},
  title={{Cross-Lingual Text-To-Speech Synthesis via Domain Adaptation and Perceptual Similarity Regression in Speaker Space}},
  year=2020,
  booktitle={Proc. Interspeech 2020},
  pages={2947--2951},
  doi={10.21437/Interspeech.2020-2070},
  url={http://dx.doi.org/10.21437/Interspeech.2020-2070}
}