First International Conference on Spoken Language Processing (ICSLP 90)

Kobe, Japan
November 18-22, 1990

Design Considerations and Text Selection for BREF, a Large French Read-Speech Corpus

Jean-Luc Gauvain, Lori F. Lamel, Maxine Eskenazi

LIMSI-CNRS, BP 133, Orsay, France

BREF, a large read-speech corpus in French has been designed with several aims: to provide enough speech data to develop dictation machines, to provide data for evaluation of continuous speech recognition systems (both speaker-dependent and speaker-independent), and to provide a corpus of continuous speech to study phonological variations. This paper presents some of the design considerations of BREF, focusing on the text analysis and the selection of text materials. The texts to be read were selected from 5 million words of the French newspaper, Le Monde. In total, 11,000 texts were selected, with an emphasis on maximizing the number of distinct triphones. Separate text materials were selected for training and test corpora. The goal is to obtain about 10,000 words (approximately 60-70 min.) of speech from each of 100 speakers, from different French dialects.

Full Paper

Bibliographic reference.  Gauvain, Jean-Luc / Lamel, Lori F. / Eskenazi, Maxine (1990): "Design considerations and text selection for BREF, a large French read-speech corpus", In ICSLP-1990, 1097-1100.