Contemporary Polish Language Model (Version 2) Using Big Data and Sub-Word Approach

Krzysztof Wołk


Language and vocabulary continue to evolve in this era of big data, making language modelling an important language processing task that benefits from the enormous data in different languages provided by web-based corpora. In this paper, we present a set of 6-gram language models based on a big-data training of the contemporary Polish language, using the Common Crawl corpus (a compilation of over 3.25 billion webpages) and other resources. The corpus is provided in different combinations of POS-tagged, grammatical groups-tagged, and sub-word-divided versions of raw corpora and trained models. The dictionary of contemporary Polish was updated and presented, and we used the KENLM toolkit to train big-data language models in ARPA format. Additionally, we have provided pre-trained vector models. The language model was trained, and the advances in BLEU score were obtained in MT systems along with the perplexity values, utilizing our models. The superiority of our model over Google’s WEB1T n-gram counts and the first version of our model was demonstrated through experiments, and the results illustrated that it guarantees improved quality in perplexity and machine translation. Our models can be applied in several natural language processing tasks and several scientific interdisciplinary fields.


 DOI: 10.21437/Interspeech.2020-1207

Cite as: Wołk, K. (2020) Contemporary Polish Language Model (Version 2) Using Big Data and Sub-Word Approach. Proc. Interspeech 2020, 4931-4935, DOI: 10.21437/Interspeech.2020-1207.


@inproceedings{Wołk2020,
  author={Krzysztof Wołk},
  title={{Contemporary Polish Language Model (Version 2) Using Big Data and Sub-Word Approach}},
  year=2020,
  booktitle={Proc. Interspeech 2020},
  pages={4931--4935},
  doi={10.21437/Interspeech.2020-1207},
  url={http://dx.doi.org/10.21437/Interspeech.2020-1207}
}