September 22-25, 1997
In this paper, we propose and investigate a new approach towards using multiple time scale information in automatic speech recognition (ASR) systems. In this framework, we are using a particular HMM formalism able to process different input streams and to recombine them at some temporal anchor points. While the phonological level of recombination has to be defined a priori, the optimal temporal anchor points are obtained automatically during recognition. In the current approach, those parallel cooperative HMMs will focus on different dynamic properties of the speech signal, defined on different time scales. The speech signal is then defined in terms of several information streams, each stream resulting from a particular way of analyzing the speech signal. More specifically, in the current work, models aimed at capturing the syllable level temporal structure are used in parallel with classical phoneme-based models. Tests on different continuous speech databases show significant performance improvements, motivating further research to eficiently use large time span information of the order of 200 ms into our standard 10 ms, phone-based ASR systems.
Bibliographic reference. Dupont, Stéphane / Bourlard, Hervé (1997): "Using multiple time scales in a multi-stream speech recognition system", In EUROSPEECH-1997, 3-6.