Cyclic Spectral Modeling for Unsupervised Unit Discovery into Voice Conversion with Excitation and Waveform Modeling

Patrick Lumban Tobing, Tomoki Hayashi, Yi-Chiao Wu, Kazuhiro Kobayashi, Tomoki Toda


We present a novel approach of cyclic spectral modeling for unsupervised discovery of speech units into voice conversion with excitation network and waveform modeling. Specifically, we propose two spectral modeling techniques: 1) cyclic vector-quantized autoencoder (CycleVQVAE), and 2) cyclic variational autoencoder (CycleVAE). In CycleVQVAE, a discrete latent space is used for the speech units, whereas, in CycleVAE, a continuous latent space is used. The cyclic structure is developed using the reconstruction flow and the cyclic reconstruction flow of spectral features, where the latter is obtained by recycling the converted spectral features. This method is used to obtain a possible speaker-independent latent space because of marginalization on all possible speaker conversion pairs during training. On the other hand, speaker-dependent space is conditioned with a one-hot speaker-code. Excitation modeling is developed in a separate manner for CycleVQVAE, while it is in a joint manner for CycleVAE. To generate speech waveform, WaveNet-based waveform modeling is used. The proposed framework is entried for the ZeroSpeech Challenge 2020, and is capable of reaching a character error rate of 0.21, a speaker similarity score of 3.91, a mean opinion score of 3.84 for the naturalness of the converted speech in the 2019 voice conversion task.


 DOI: 10.21437/Interspeech.2020-2559

Cite as: Tobing, P.L., Hayashi, T., Wu, Y., Kobayashi, K., Toda, T. (2020) Cyclic Spectral Modeling for Unsupervised Unit Discovery into Voice Conversion with Excitation and Waveform Modeling. Proc. Interspeech 2020, 4861-4865, DOI: 10.21437/Interspeech.2020-2559.


@inproceedings{Tobing2020,
  author={Patrick Lumban Tobing and Tomoki Hayashi and Yi-Chiao Wu and Kazuhiro Kobayashi and Tomoki Toda},
  title={{Cyclic Spectral Modeling for Unsupervised Unit Discovery into Voice Conversion with Excitation and Waveform Modeling}},
  year=2020,
  booktitle={Proc. Interspeech 2020},
  pages={4861--4865},
  doi={10.21437/Interspeech.2020-2559},
  url={http://dx.doi.org/10.21437/Interspeech.2020-2559}
}