Third International Conference on Spoken Language Processing (ICSLP 94)
In this paper we address two aspects of linguistic corpus construction. First we examine the process of selecting the units to be covered in our design. Rather than enumerating a set of fixed-length units, we derive variable-length units based on a measure of cohesiveness. Next we consider the selection of material to cover efficiently these, or other, units. Our scoring procedure takes into account frequency distributions to improve the result's compactness. The proposed techniques have been successfully applied to the design of a handwriting corpus at MIT and a speech corpus elsewhere.
Bibliographic reference. Kassel, Rob (1994): "Automating the design of compact linguistic corpora", In ICSLP-1994, 1827-1830.