MLS: A Large-Scale Multilingual Dataset for Speech Research

Vineel Pratap, Qiantong Xu, Anuroop Sriram, Gabriel Synnaeve, Ronan Collobert

This paper introduces Multilingual LibriSpeech (MLS) dataset, a large multilingual corpus suitable for speech research. The dataset is derived from read audiobooks from LibriVox and consists of 8 languages, including about 32K hours of English and a total of 4.5K hours for other languages. We provide baseline Automatic Speech Recognition (ASR) models and Language Models (LM) for all the languages in our dataset. We believe such a large transcribed dataset will open new avenues in ASR and Text-To-Speech (TTS) research. The dataset will be made freely available for anyone at

 DOI: 10.21437/Interspeech.2020-2826

Cite as: Pratap, V., Xu, Q., Sriram, A., Synnaeve, G., Collobert, R. (2020) MLS: A Large-Scale Multilingual Dataset for Speech Research. Proc. Interspeech 2020, 2757-2761, DOI: 10.21437/Interspeech.2020-2826.

  author={Vineel Pratap and Qiantong Xu and Anuroop Sriram and Gabriel Synnaeve and Ronan Collobert},
  title={{MLS: A Large-Scale Multilingual Dataset for Speech Research}},
  booktitle={Proc. Interspeech 2020},