Speech Prosody 2010

Chicago, IL, USA
May 10-14, 2010

Prediction and Realisation of Conversational Characteristics by Utilising Spontaneous Speech for Unit Selection

Sebastian Andersson (1), Kallirroi Georgila (2), David Traum (2), Matthew Aylett (1,3), Robert A. J. Clark (1)

(1) The Centre for Speech Technology Research, University of Edinburgh, Edinburgh, UK
(2) Institute for Creative Technologies, University of Southern California, Los Angeles, USA
(3) CereProc Ltd, Edinburgh, UK

Unit selection speech synthesis has reached high levels of naturalness and intelligibility for neutral read aloud speech. However, synthetic speech generated using neutral read aloud data lacks all the attitude, intention and spontaneity associated with everyday conversations. Unit selection is heavily data dependent and thus in order to simulate human conversational speech, or create synthetic voices for believable virtual characters, we need to utilise speech data with examples of how people talk rather than how people read. In this paper we included carefully selected utterances from spontaneous conversational speech in a unit selection voice. Using this voice and by automatically predicting type and placement of lexical fillers and filled pauses we can synthesise utterances with conversational characteristics. A perceptual listening test showed that it is possible to make synthetic speech sound more conversational without degrading naturalness.

Index Terms: speech synthesis, unit selection, conversation, spontaneous speech, lexical fillers, filled pauses

Full Paper

Bibliographic reference.  Andersson, Sebastian / Georgila, Kallirroi / Traum, David / Aylett, Matthew / Clark, Robert A. J. (2010): "Prediction and realisation of conversational characteristics by utilising spontaneous speech for unit selection", In SP-2010, paper 116.