Second ESCA/IEEE Workshop on Speech Synthesis
September 12-15, 1994
The past several years have seen fundamental changes in the methods of speech communication research and in the approaches to speech technology development. We can now test theories on a scale that was impractical a few years ago. Where theory fails, we can now contemplate a purely empirical approach to describing speech phenomena. This change has been partly driven by several factors related to computational technology, including: (1) enhanced communication across disciplines (E.g. engineering - linguistics); (2) availability of large standard speech databases; (3) mathematically and computationally tractable models of speech acoustic realizations (e.g. hidden Markov models); (4) more powerful computers; (5) standardized, portable speech R&D tools; and (5) demands from potential applications.
The practical manifestations of this change are emerging in the form new speech products and greatly improved performance of laboratory speech recognition systems. Speech synthesis technol- ogy is also beginning to reap the benefits of these changes. For example, striking improvements have been made in text-to-speech (TTS) systems as a result of empirical studies of natural productions at the prosodic and segmental levels. Large-scale text analyses have led to more robust methods for part-of-speech determination, homograph disambiguation, and stressing of complex nominals. We can anticipate that this trend will continue and develop tools to enable new methods. Several technical areas benefit from the availability of laxge time-aligned phonetically-labeled databases. Basic speech and language studies concerned with timing of events, dialectical variation, coarticulation and segmental-prosodic interactions usually require some annotation of the speech signal at the segmental level. Development of convincing models of segment duration variation as a function of context is important to the implementation of high-quality TTS systems. Acoustic models used as the basis for speech recognizers can be trained more quickly if good initial estimates of the phone and word boundaries can be obtained. Synchronization of cross-modal stimuli with speech is of interest to those studying perception and in the entertainment industry where realistic facial animations can be achieved through such synchronization.
The idea of automatic text-speech alignment is an old one. Several implementation approaches have been taken with varying degrees of success. Space does not permit even a cursory review of the extensive work in this area. While many of these systems achieve very good performance, they are often inconvenient to use and/oF are not easily accessible to many who could benefit from them. The work described in this paper is a direct result of interactions at the Third Prosody Workshop held at Ohio State University in June 1993, where there was some discussion about ways to automate parts of the ToBI transcription procedure. It became clear that considerable progress along these lines could be made if a text-speech aligner could be provided to more workers in the field.
The major contributions offered by the aligner described here are: (1) Accessibility: The system is readily available and can run on most UNIX workstations. (2) Ease of use: The requirements for human intervention in the process have been reduced to a minimum. Convenient graphical interfaces are provided for all operator tasks. (3) Good alignment accuracy is achieved in a speaker- independent mode. (4) Phone sequence and boundaries are determined automatically from the word sequence and the speech signal. (5) Fast alignment: It runs in near real time on common workstations. (6) Evaluation was performed on a standard database.
Bibliographic reference. Talkin, David / Wightman, Colin W. (1994): "The aligner: text to speech alignment using Markov models and a pronunciation dictionary", In SSW2-1994, 89-92.