Can Auditory Nerve Models Tell us What’s Different About WaveNet Vocoded Speech?

S├ębastien Le Maguer, Naomi Harte


Nowadays, synthetic speech is almost indistinguishable from human speech. The remarkable quality is mainly due to the displacing of signal processing based vocoders in favour of neural vocoders and, in particular, the WaveNet architecture. At the same time, speech synthesis evaluation is still facing difficulties in adjusting to these improvements. These difficulties are even more prevalent in the case of objective evaluation methodologies which do not correlate well with human perception. Yet, an often forgotten use of objective evaluation is to uncover prominent differences between speech signals. Such differences are crucial to decipher the improvement introduced by the use of WaveNet. Therefore, abandoning objective evaluation could be a serious mistake. In this paper, we analyze vocoded synthetic speech re-rendered using WaveNet, comparing it to standard vocoded speech. To do so, we objectively compare spectrograms and neurograms, the latter being the output of AN models. The spectrograms allow us to look at the speech production side, and the neurograms relate to the speech perception path. While we were not yet able to pinpoint how WaveNet and WORLD differ, our results suggest that the Mean-Rate (MR) neurograms in particular warrant further investigation.


 DOI: 10.21437/Interspeech.2020-2596

Cite as: Maguer, S.L., Harte, N. (2020) Can Auditory Nerve Models Tell us What’s Different About WaveNet Vocoded Speech?. Proc. Interspeech 2020, 230-234, DOI: 10.21437/Interspeech.2020-2596.


@inproceedings{Maguer2020,
  author={S├ębastien Le Maguer and Naomi Harte},
  title={{Can Auditory Nerve Models Tell us What’s Different About WaveNet Vocoded Speech?}},
  year=2020,
  booktitle={Proc. Interspeech 2020},
  pages={230--234},
  doi={10.21437/Interspeech.2020-2596},
  url={http://dx.doi.org/10.21437/Interspeech.2020-2596}
}