Perception of Concatenative vs. Neural Text-To-Speech (TTS): Differences in Intelligibility in Noise and Language Attitudes

Michelle Cohn, Georgia Zellou


This study tests speech-in-noise perception and social ratings of speech produced by different text-to-speech (TTS) synthesis methods. We used identical speaker training datasets for a set of 4 voices (using AWS Polly TTS), generated using neural and concatenative TTS. In Experiment 1, listeners identified target words in semantically predictable and unpredictable sentences in concatenative and neural TTS at two noise levels (-3 dB, -6 dB SNR). Correct word identification was lower for neural TTS than for concatenative TTS, in the lower SNR, and for semantically unpredictable sentences. In Experiment 2, listeners rated the voices on 4 social attributes. Neural TTS was rated as more human-like, natural, likeable, and familiar than concatenative TTS. Furthermore, how natural listeners rated the neural TTS voice was positively related to their speech-in-noise accuracy. Together, these findings show that the TTS method influences both intelligibility and social judgments of speech — and that these patterns are linked. Overall, this work contributes to our understanding of the nexus of speech technology and human speech perception.


 DOI: 10.21437/Interspeech.2020-1336

Cite as: Cohn, M., Zellou, G. (2020) Perception of Concatenative vs. Neural Text-To-Speech (TTS): Differences in Intelligibility in Noise and Language Attitudes. Proc. Interspeech 2020, 1733-1737, DOI: 10.21437/Interspeech.2020-1336.


@inproceedings{Cohn2020,
  author={Michelle Cohn and Georgia Zellou},
  title={{Perception of Concatenative vs. Neural Text-To-Speech (TTS): Differences in Intelligibility in Noise and Language Attitudes}},
  year=2020,
  booktitle={Proc. Interspeech 2020},
  pages={1733--1737},
  doi={10.21437/Interspeech.2020-1336},
  url={http://dx.doi.org/10.21437/Interspeech.2020-1336}
}