5th European Conference on Speech Communication and Technology

Rhodes, Greece
September 22-25, 1997

The Design of a Large Vocabulary Speech Corpus for Portuguese

Joao P. Neto, Ciro A. Martins, Hugo Meinedo, Luis B. Almeida

INESC - IST R. Alves Redol, Lisboa, Portugal

The last years show a great development of large vocabulary, speaker-independent continuous speech recognition systems and some research in multilingual aspects. To allow that development to also be extended to the European Portuguese language we decided to develop and collect a large database of continuous speech based on a large amount of text. In the development of this new Portuguese database our aim was to create a corpus equivalent in size to WSJ0. We selected the database texts from the PUBLICO newspaper, which is characterized by a broad coverage of matters and different writing styles. The recording population was selected from a large engineering school, assuring a large variability of speakers. The recordings are being done as we write this paper and we expect to release the database in CD format in September 1997.

Full Paper

Bibliographic reference.  Neto, Joao P. / Martins, Ciro A. / Meinedo, Hugo / Almeida, Luis B. (1997): "The design of a large vocabulary speech corpus for portuguese", In EUROSPEECH-1997, 1707-1710.