Third International Conference on Spoken Language Processing (ICSLP 94)

Yokohama, Japan
September 18-22, 1994

Automating the Design of Compact Linguistic Corpora

Rob Kassel

Spoken Language Systems Group, Laboratory for Computer Science, Massachusetts Institute of Technology, Cambridge, Massachusetts, USA

In this paper we address two aspects of linguistic corpus construction. First we examine the process of selecting the units to be covered in our design. Rather than enumerating a set of fixed-length units, we derive variable-length units based on a measure of cohesiveness. Next we consider the selection of material to cover efficiently these, or other, units. Our scoring procedure takes into account frequency distributions to improve the result's compactness. The proposed techniques have been successfully applied to the design of a handwriting corpus at MIT and a speech corpus elsewhere.

Full Paper

Bibliographic reference.  Kassel, Rob (1994): "Automating the design of compact linguistic corpora", In ICSLP-1994, 1827-1830.