Third ESCA/COCOSDA Workshop on Speech Synthesis

November 26-29, 1998
Jenolan Caves House, Blue Mountains, NSW, Australia

Trying to Mimic Human Segmentation of Speech Using HMM and Fuzzy Logic Post-correction Rules

D. Torre Toledano, M. A. Rodríguez Crespo, J. G. Escalada Sardina

Speech Technology Group, Telefónica Investigación y Desarrollo, Madrid, Spain

The process of human segmentation and labelling of speech can be seen as a two-step process. In the first step humans listen to a speech signal, recognize the word and phoneme sequence, and roughly determine the position of each phonetic boundary. In the second step humans examine several speech signal features (waveform, energy, spectrogram, etc.) to place a phonetic boundary time mark where these features best satisfy a certain set of conditions specific for that kind of phonetic boundary. In this paper an automatic two-stage system for phonetic segmentation and labelling of speech is presented. This system tries to mimic the two-step process of human segmentation and labelling of speech. The first stage of the system is a context-dependent phonetic HMM recognizer that yields the recognized phoneme sequence and a set of rough phonetic boundary time marks. The second stage extracts several speech signal features that are intended to be the counterpart of those examined by humans. These features are used to refine each rough time mark obtained in the first stage. Each time mark is moved to a near position where the degree of truthfulness of a certain set of fuzzy logic conditions (specific for that kind of phonetic boundary) is maximum. These fuzzy logic conditions are intended to be the counterpart of the conditions tested by humans.

