First International Conference on Spoken Language Processing (ICSLP 90)

Kobe, Japan
November 18-22, 1990

Collection and Analysis of Spontaneous and Read Corpora for Spoken Language System Development

Michal Soclof, Victor W. Zue

Spoken Language Systems Group, Laboratory for Computer Science, Massachusetts Institute of Technology, Cambridge, MA, USA

As part of our effort in developing the MIT voyager system, we recently collected speech data from 50 male and 50 female subjects under a simulation mode, in which spoken sentences were typed into the computer for automatic natural language processing and response generation. Since a computer log of the spoken dialogue was maintained, we were able to ask the subjects to provide read versions of the sentences as well. Thus the corpus includes both a read version arid a spontaneous version of (approximately) the same sentence, modulo false starts and filled pauses in the spontaneous version. These corpora enable us to make a direct comparison between spontaneous and read speech. All in all, we were able to collect nearly 5000 spontaneous sentences, with an equal number of read sentences. All sentences were digitized and orthographically transcribed. In addition, time-aligned phonetic transcriptions were obtained for about 35% of the data. This paper documents the data collection process, and provides some linguistic and acoustic analyses of the collected data.

