Differences in Gradient Emotion Perception: Human vs. Alexa Voices

Michelle Cohn, Eran Raveh, Kristin Predeck, Iona Gessinger, Bernd Möbius, Georgia Zellou

The present study compares how individuals perceive gradient acoustic realizations of emotion produced by a human voice versus an Amazon Alexa text-to-speech (TTS) voice. We manipulated semantically neutral sentences spoken by both talkers with identical emotional synthesis methods, using three levels of increasing ‘happiness’ (0%, 33%, 66% ‘happier’). On each trial, listeners (native speakers of American English, n=99) rated a given sentence on two scales to assess dimensions of emotion: valence (negative-positive) and arousal (calm-excited). Participants also rated the Alexa voice on several parameters to assess anthropomorphism (e.g., naturalness, human-likeness, etc.). Results showed that the emotion manipulations led to increases in perceived positive valence and excitement. Yet, the effect differed by interlocutor: increasing ‘happiness’ manipulations led to larger changes for the human voice than the Alexa voice. Additionally, we observed individual differences in perceived valence/arousal based on participants’ anthropomorphism scores. Overall, this line of research can speak to theories of computer personification and elucidate our changing relationship with voice-AI technology.

 DOI: 10.21437/Interspeech.2020-1938

Cite as: Cohn, M., Raveh, E., Predeck, K., Gessinger, I., Möbius, B., Zellou, G. (2020) Differences in Gradient Emotion Perception: Human vs. Alexa Voices. Proc. Interspeech 2020, 1818-1822, DOI: 10.21437/Interspeech.2020-1938.

  author={Michelle Cohn and Eran Raveh and Kristin Predeck and Iona Gessinger and Bernd Möbius and Georgia Zellou},
  title={{Differences in Gradient Emotion Perception: Human vs. Alexa Voices}},
  booktitle={Proc. Interspeech 2020},