Face2Speech: Towards Multi-Speaker Text-to-Speech Synthesis Using an Embedding Vector Predicted from a Face Image

Shunsuke Goto, Kotaro Onishi, Yuki Saito, Kentaro Tachibana, Koichiro Mori


We are quite able to imagine voice characteristics of a speaker from his/her appearance, especially a face. In this paper, we propose Face2Speech, which generates speech with its characteristics predicted from a face image. This framework consists of three separately trained modules: a speech encoder, a multi-speaker text-to-speech (TTS), and a face encoder. The speech encoder outputs an embedding vector which is distinguishable from other speakers. The multi-speaker TTS synthesizes speech by using the embedding vector, and then the face encoder outputs the embedding vector of a speaker from the speaker’s face image. Experimental results of matching and naturalness tests demonstrate that synthetic speech generated with the face-derived embedding vector is comparable to one with the speech-derived embedding vector.


 DOI: 10.21437/Interspeech.2020-2136

Cite as: Goto, S., Onishi, K., Saito, Y., Tachibana, K., Mori, K. (2020) Face2Speech: Towards Multi-Speaker Text-to-Speech Synthesis Using an Embedding Vector Predicted from a Face Image. Proc. Interspeech 2020, 1321-1325, DOI: 10.21437/Interspeech.2020-2136.


@inproceedings{Goto2020,
  author={Shunsuke Goto and Kotaro Onishi and Yuki Saito and Kentaro Tachibana and Koichiro Mori},
  title={{Face2Speech: Towards Multi-Speaker Text-to-Speech Synthesis Using an Embedding Vector Predicted from a Face Image}},
  year=2020,
  booktitle={Proc. Interspeech 2020},
  pages={1321--1325},
  doi={10.21437/Interspeech.2020-2136},
  url={http://dx.doi.org/10.21437/Interspeech.2020-2136}
}