5th European Conference on Speech Communication and Technology

Rhodes, Greece
September 22-25, 1997

Automatic Transcription of General Audio Data: Effect of Environment Segmentation on Phonetic Recognition 1

Michelle S. Spina, Victor W. Zue

Spoken Language Systems Group Laboratory for Computer Science Massachusetts Institute of Technology Cambridge, Massachusetts, USA

The task of automatically transcribing general audio data is very different from those usually confronted by current automatic speech recognition systems. The general goal of our work is to determine the optimal training strategy for recognizing such data. Specifically, we have studied the effects of different speaking environments on a phonetic recognition task using data collected from a radio news program. We found that if a single-recognizer is to be used, it is more effective to use a smaller amount of homogeneous, clean data for training. This approach yielded a decrease in phonetic recognition error rate of over 26% over a system trained with an equivalent amount of data which contained a variety of speaking environments. We found that additional gains can be made with a multiple- recognizer system, trained with environment-specific data. Overall, we found that this approach yielded a decrease in error rate of nearly 2%, with some individual speaking environments' error rate decreasing by over 7%.

Full Paper

Bibliographic reference.  Spina, Michelle S. / Zue, Victor W. (1997): "Automatic transcription of general audio data: effect of environment segmentation on phonetic recognition 1", In EUROSPEECH-1997, 1547-1550.