First International Conference on Spoken Language Processing (ICSLP 90)

Kobe, Japan
November 18-22, 1990

Duration, Pitch and Diphones in the CSTR TTS System

W. Nick Campbell, Stephen D. Isard, Alex I. C. Monaghan, J. Verhoeven

Edinburgh University, Centre for Speech Technology Research, Edinburgh, UK

This paper describes the prosodic processing and wave-form generation components of the text-to-speech system being developed at Edinburgh University's Centre for Speech Technology Research. Intonation is specified as a sequence of minimal descriptors whose locations are given in terms of syntactically-determined prosodic domains. A pitch contour is computed by converting the descriptors into a sequence of abstract targets whose absolute values depend on a specific speaker model. Duration is determined first at the level of the syllable by a neural network, then accommodated at the segment level according to the distributions observed in a phonetically balanced database. The output waveform is generated by LPC resynthesis of diphone units. Three methods of diphone segmentation are discussed.

Full Paper

Bibliographic reference.  Campbell, W. Nick / Isard, Stephen D. / Monaghan, Alex I. C. / Verhoeven, J. (1990): "Duration, pitch and diphones in the CSTR TTS system", In ICSLP-1990, 825-828.