First International Conference on Spoken Language Processing (ICSLP 90)
As part of our effort in developing the MIT voyager system, we recently collected speech data from 50 male and 50 female subjects under a simulation mode, in which spoken sentences were typed into the computer for automatic natural language processing and response generation. Since a computer log of the spoken dialogue was maintained, we were able to ask the subjects to provide read versions of the sentences as well. Thus the corpus includes both a read version arid a spontaneous version of (approximately) the same sentence, modulo false starts and filled pauses in the spontaneous version. These corpora enable us to make a direct comparison between spontaneous and read speech. All in all, we were able to collect nearly 5000 spontaneous sentences, with an equal number of read sentences. All sentences were digitized and orthographically transcribed. In addition, time-aligned phonetic transcriptions were obtained for about 35% of the data. This paper documents the data collection process, and provides some linguistic and acoustic analyses of the collected data.
Bibliographic reference. Soclof, Michal / Zue, Victor W. (1990): "Collection and analysis of spontaneous and read corpora for spoken language system development", In ICSLP-1990, 1105-1108.