MLS: A Large-Scale Multilingual Dataset for Speech Research

Vineel Pratap, Qiantong Xu, Anuroop Sriram, Gabriel Synnaeve, Ronan Collobert


This paper introduces Multilingual LibriSpeech (MLS) dataset, a large multilingual corpus suitable for speech research. The dataset is derived from read audiobooks from LibriVox and consists of 8 languages, including about 32K hours of English and a total of 4.5K hours for other languages. We provide baseline Automatic Speech Recognition (ASR) models and Language Models (LM) for all the languages in our dataset. We believe such a large transcribed dataset will open new avenues in ASR and Text-To-Speech (TTS) research. The dataset will be made freely available for anyone at http://www.openslr.org


 DOI: 10.21437/Interspeech.2020-2826

Cite as: Pratap, V., Xu, Q., Sriram, A., Synnaeve, G., Collobert, R. (2020) MLS: A Large-Scale Multilingual Dataset for Speech Research. Proc. Interspeech 2020, 2757-2761, DOI: 10.21437/Interspeech.2020-2826.


@inproceedings{Pratap2020,
  author={Vineel Pratap and Qiantong Xu and Anuroop Sriram and Gabriel Synnaeve and Ronan Collobert},
  title={{MLS: A Large-Scale Multilingual Dataset for Speech Research}},
  year=2020,
  booktitle={Proc. Interspeech 2020},
  pages={2757--2761},
  doi={10.21437/Interspeech.2020-2826},
  url={http://dx.doi.org/10.21437/Interspeech.2020-2826}
}