13th Annual Conference of the International Speech Communication Association

Portland, OR, USA
September 9-13, 2012

Psychoacoustic Segment Scoring for Multi-Form Speech Synthesis

Alexander Sorin (1), Slava Shechtman (1), Vincent Pollet (2)

(1) Speech Technologies, IBM Haifa Research Lab, Haifa, Israel
(2) Text-To-Speech Research, Nuance Communications, Merelbeke, Belgium

In multi-form segment synthesis, output speech is constructed by splicing waveform segments with statistically modeled and regenerated parametric speech segments. The fraction of model-derived segments is called model-template ratio. The motivation of this work is to further increase flexibility of multi-form synthesis maintaining high speech quality for high model-template ratios. An approach is presented where the representation type of a segment is selected per acoustic leaf. We introduce a novel method for leaf representation selection based on a psychoacoustic segment stationarity score. Additionally, refinements in multi-form segment concatenation including boundary constrained statistical parametric synthesis and timedomain alignment based on multi-peak analysis of cross-correlation for high model-template ratio multi-form synthesis are presented.

Index Terms: speech synthesis, multi-form segments, speech stationarity, psychoacoustic segment scoring, statistical parametric synthesis, segment concatenation

Full Paper

Bibliographic reference.  Sorin, Alexander / Shechtman, Slava / Pollet, Vincent (2012): "Psychoacoustic segment scoring for multi-form speech synthesis", In INTERSPEECH-2012, 2214-2217.