Deep Learning-Based Telephony Speech Recognition in the Wild

Kyu J. Han, Seongjun Hahm, Byung-Hak Kim, Jungsuk Kim, Ian Lane


In this paper, we explore the effectiveness of a variety of Deep Learning-based acoustic models for conversational telephony speech, specifically TDNN, bLSTM and CNN-bLSTM models. We evaluated these models on both research testsets, such as Switchboard and CallHome, as well as recordings from a real-world call-center application. Our best single system, consisting of a single CNN-bLSTM acoustic model, obtained a WER of 5.7% on the Switchboard testset, and in combination with other models a WER of 5.3% was obtained. On the CallHome testset a WER of 10.1% was achieved with model combination. On the test data collected from real-world call-centers, even with model adaptation using application specific data, the WER was significantly higher at 15.0%. We performed an error analysis on the real-world data and highlight the areas where speech recognition still has challenges.


 DOI: 10.21437/Interspeech.2017-1695

Cite as: Han, K.J., Hahm, S., Kim, B., Kim, J., Lane, I. (2017) Deep Learning-Based Telephony Speech Recognition in the Wild. Proc. Interspeech 2017, 1323-1327, DOI: 10.21437/Interspeech.2017-1695.


@inproceedings{Han2017,
  author={Kyu J. Han and Seongjun Hahm and Byung-Hak Kim and Jungsuk Kim and Ian Lane},
  title={Deep Learning-Based Telephony Speech Recognition in the Wild},
  year=2017,
  booktitle={Proc. Interspeech 2017},
  pages={1323--1327},
  doi={10.21437/Interspeech.2017-1695},
  url={http://dx.doi.org/10.21437/Interspeech.2017-1695}
}