Foreign Accent Conversion by Synthesizing Speech from Phonetic Posteriorgrams

Guanlong Zhao, Shaojin Ding, Ricardo Gutierrez-Osuna

Methods for foreign accent conversion (FAC) aim to generate speech that sounds similar to a given non-native speaker but with the accent of a native speaker. Conventional FAC methods borrow excitation information (F0 and aperiodicity; produced by a conventional vocoder) from a reference (i.e., native) utterance during synthesis time. As such, the generated speech retains some aspects of the voice quality of the native speaker. We present a framework for FAC that eliminates the need for conventional vocoders (e.g., STRAIGHT, World) and therefore the need to use the native speaker’s excitation. Our approach uses an acoustic model trained on a native speech corpus to extract speaker-independent phonetic posteriorgrams (PPGs), and then train a speech synthesizer to map PPGs from the non-native speaker into the corresponding spectral features, which in turn are converted into the audio waveform using a high-quality neural vocoder. At runtime, we drive the synthesizer with the PPG extracted from a native reference utterance. Listening tests show that the proposed system produces speech that sounds more clear, natural, and similar to the non-native speaker compared with a baseline system, while significantly reducing the perceived foreign accent of non-native utterances.

 DOI: 10.21437/Interspeech.2019-1778

Cite as: Zhao, G., Ding, S., Gutierrez-Osuna, R. (2019) Foreign Accent Conversion by Synthesizing Speech from Phonetic Posteriorgrams. Proc. Interspeech 2019, 2843-2847, DOI: 10.21437/Interspeech.2019-1778.

  author={Guanlong Zhao and Shaojin Ding and Ricardo Gutierrez-Osuna},
  title={{Foreign Accent Conversion by Synthesizing Speech from Phonetic Posteriorgrams}},
  booktitle={Proc. Interspeech 2019},