Single Headed Attention Based Sequence-to-Sequence Model for State-of-the-Art Results on Switchboard

Zoltán Tüske, George Saon, Kartik Audhkhasi, Brian Kingsbury


It is generally believed that direct sequence-to-sequence (seq2seq) speech recognition models are competitive with hybrid models only when a large amount of data, at least a thousand hours, is available for training. In this paper, we show that state-of-the-art recognition performance can be achieved on the Switchboard-300 database using a single headed attention, LSTM based model. Using a cross-utterance language model, our single-pass speaker independent system reaches 6.4% and 12.5% word error rate (WER) on the Switchboard and CallHome subsets of Hub5’00, without a pronunciation lexicon. While careful regularization and data augmentation are crucial in achieving this level of performance, experiments on Switchboard-2000 show that nothing is more useful than more data. Overall, the combination of various regularizations and a simple but fairly large model results in a new state of the art, 4.8% and 8.3% WER on the Switchboard and CallHome sets, using SWB-2000 without any external data resources.


 DOI: 10.21437/Interspeech.2020-1488

Cite as: Tüske, Z., Saon, G., Audhkhasi, K., Kingsbury, B. (2020) Single Headed Attention Based Sequence-to-Sequence Model for State-of-the-Art Results on Switchboard. Proc. Interspeech 2020, 551-555, DOI: 10.21437/Interspeech.2020-1488.


@inproceedings{Tüske2020,
  author={Zoltán Tüske and George Saon and Kartik Audhkhasi and Brian Kingsbury},
  title={{Single Headed Attention Based Sequence-to-Sequence Model for State-of-the-Art Results on Switchboard}},
  year=2020,
  booktitle={Proc. Interspeech 2020},
  pages={551--555},
  doi={10.21437/Interspeech.2020-1488},
  url={http://dx.doi.org/10.21437/Interspeech.2020-1488}
}