Advancing Sequence-to-Sequence Based Speech Recognition

Zoltán Tüske, Kartik Audhkhasi, George Saon

The paper presents our endeavor to improve state-of-the-art speech recognition results using attention based neural network approaches. Our test focus was LibriSpeech, a well-known, publicly available, large, speech corpus, but the methodologies are clearly applicable to other tasks. After systematic application of standard techniques — sophisticated data augmentation, various dropout schemes, scheduled sampling, warm-restart —, and optimizing search configurations, our model achieves 4.0% and 11.7% word error rate (WER) on the test-clean and test-other sets, without any external language model. A powerful recurrent language model drops the error rate further to 2.7% and 8.2%. Thus, we not only report the lowest sequence-to-sequence model based numbers on this task to date, but our single system even challenges the best result known in the literature, namely a hybrid model together with recurrent language model rescoring. A simple ROVER combination of several of our attention based systems achieved 2.5% and 7.3% WER on the clean and other test sets.

 DOI: 10.21437/Interspeech.2019-3018

Cite as: Tüske, Z., Audhkhasi, K., Saon, G. (2019) Advancing Sequence-to-Sequence Based Speech Recognition. Proc. Interspeech 2019, 3780-3784, DOI: 10.21437/Interspeech.2019-3018.

  author={Zoltán Tüske and Kartik Audhkhasi and George Saon},
  title={{Advancing Sequence-to-Sequence Based Speech Recognition}},
  booktitle={Proc. Interspeech 2019},