13th Annual Conference of the International Speech Communication Association

Portland, OR, USA
September 9-13, 2012

Rethinking the Corpus: Moving towards Dynamic Linguistic Resources

Andrew Rosenberg

Department of Computer Science, Queens College (CUNY), Flushing, NY, USA

The corpus is an invaluable resource in Spoken and Natural Language Processing. Consistent data sets has allowed for empirical evaluation of competing algorithms. The sharing of high-quality annotated linguistic data has enabled participation and experimentation by a wide range of researchers. However, despite dubbing these annotations as "gold-standard", many corpora contain labeling errors and idiosyncrasies. The current view of the corpus as a static resource make correction of errors and other modifications prohibitively difficult. In this paper, a perspective of the corpus as dynamically changing is advanced. Version control software can provide a mechanism to facilitate this. We highlight the problems of the static view of the corpus through case studies of the Penn Treebank, Switchboard, Hub-4 and Boston University Radio News Corpus.

Index Terms: Linguistic Resources, Opinion paper

Full Paper

Bibliographic reference.  Rosenberg, Andrew (2012): "Rethinking the corpus: moving towards dynamic linguistic resources", In INTERSPEECH-2012, 1392-1395.