EUROSPEECH 2001 Scandinavia
7th European Conference on Speech Communication and Technology

Aalborg, Denmark
September 3-7, 2001


Word Unit Based Multilingual Comparative Analysis of Text Corpora

Géza Németh, Csaba Zainkó

Budapest University of Technology and Economics, Hungary

Parallel study of three linguistically different languages - Hungarian. German and English - using text corpora of a similar size gives a possibility for the exploration of both similarities and differences. Corpora of publicly available Internet sources was used. Besides traditional corpus coverage, word length and occurence statistics, some new features about prosodic boundaries (sentence beginning and final positions, preceding and following a comma) were also computed. Among others, it was found, that the coverage of corpora by the most frequent words follows a parallel logarithmic rule for all languages in the 40-85% coverage range. The functions are much nearer for English and German than for Hungarian. The results can be applied in such diverse domains as predictive text input, word hyphenation, language modeling in speech recognition, corpus-based speech synthesis, etc. Keywords: text corpora, corpus analysis, multilinguality, word length, sentence length, unit based analysis, language modeling, corpus-based speech synthesis

Full Paper

Bibliographic reference.  Németh, Géza / Zainkó, Csaba (2001): "Word unit based multilingual comparative analysis of text corpora", In EUROSPEECH-2001, 2035-2038.