Second ESCA/IEEE Workshop on Speech Synthesis
September 12-15, 1994
One of the enduring problems in achieving natural sounding synthetic
speech is that of getting the rhythm right. Usually this problem is
construed as the search for appropriate algorithms for altering durations
of segments under various contextual conditions (eg initially versus final
in word or phrase, in stressed versus unstressed syllables). Recently,
Campbell and Isard (1991) have suggested that a more effective model is
one in which the syllable is taken as the distinguished timing unit
and segmental durations accommodated secondarily to syllable durations.
We propose here that there is no distinguished timing unit While other
synthesis systems use phonemes, diphones or other linearly arranged
phone-sized units and employ 'hidden structure1, YorkTalk uses explicit
tree-like phonological representations.
We will compare the temporal characteristics of the output of the YorkTalk system with Klattalk (Klatt, ms) on one hand and the naturalistic observations of Fowler (1981) on the other. We will show that it is possible to produce similar, natural sounding temporal relations by employing linguistic structures which are given a compositional parametric and temporal interpretation (Local, 1992; Ogden, 1992).
YorkTalk's metrical and phontactic parsers parse input into structures consisting of feet, syllables and syllable constituents. In these structures, the rime is the head of the syllable, the nucleus is the head of the rime and the strong syllable is the head of the foot (cf Coleman 1992). Every node in the graph is given a head-first temporal and parametric phonetic interpretation. A co-production model of coarticulation (cf Fowler 1980) is implemented in YorkTalk by overlaying parameters. Since the nucleus is the head of the syllable the nucleus and syllable are coextensive. By fitting the onset and coda within the temporal space of the nucleus they inherit the properties of the whole syllable. Where structures permit, constituents are shared between syllables as shown below (ambisyllabicity). The temporal interpretation of ambisyllabicity is the temporal and parametric overlaying of one syllable on another (Local, 1992; Ogden, 1992).
Bibliographic reference. Local, John / Ogden, Richard (1994): "A model of timiny for non-segmental phonological structure", In SSW2-1994, 236-239.