4th International Conference on Spoken Language Processing
Philadelphia, PA, USA
This paper describes an approach for predicting both the vocabulary size and the resulting out-of-vocabulary rate (OOV-rate) for a hypothetical extension of an existing text corpus. By splitting the original corpus into two different sub-corpora, vocabulary and OOV-rate can be determined for that special constellation. Average values are calculated for all combinations of sub-corpora and can be approximated by analytic function terms. These functions enable the easy prediction of the vocabulary size and the OOV-rate. The prediction accuracy results in a relative error below 4.6%.
Bibliographic reference. Müller, Johannes / Stahl, Holger / Lang, Manfred (1996): "Predicting the out-of-vocabulary rate and the required vocabulary size for speech processing applications", In ICSLP-1996, 1922-1925.