Speech-to-Singing Conversion Based on Boundary Equilibrium GAN

Da-Yi Wu, Yi-Hsuan Yang


This paper investigates the use of generative adversarial network (GAN)-based models for converting a speech signal into a singing one, without reference to the phoneme sequence underlying the speech. This is achieved by viewing speech-to-singing conversion as a style transfer problem. Specifically, given a speech input, and the F0 contour of the target singing output, the proposed model generates the spectrogram of a singing signal with a progressive-growing encoder/decoder architecture. Moreover, the model uses a boundary equilibrium GAN loss term such that it can learn from both paired and unpaired data. The spectrogram is finally converted into wave with a separate GAN-based vocoder. Our quantitative and qualitative analysis show that the proposed model generates singing voices with much higher naturalness than an existing non adversarially-trained baseline.


 DOI: 10.21437/Interspeech.2020-1984

Cite as: Wu, D., Yang, Y. (2020) Speech-to-Singing Conversion Based on Boundary Equilibrium GAN. Proc. Interspeech 2020, 1316-1320, DOI: 10.21437/Interspeech.2020-1984.


@inproceedings{Wu2020,
  author={Da-Yi Wu and Yi-Hsuan Yang},
  title={{Speech-to-Singing Conversion Based on Boundary Equilibrium GAN}},
  year=2020,
  booktitle={Proc. Interspeech 2020},
  pages={1316--1320},
  doi={10.21437/Interspeech.2020-1984},
  url={http://dx.doi.org/10.21437/Interspeech.2020-1984}
}