Utterance Invariant Training for Hybrid Two-Pass End-to-End Speech Recognition

Dhananjaya Gowda, Ankur Kumar, Kwangyoun Kim, Hejung Yang, Abhinav Garg, Sachin Singh, Jiyeon Kim, Mehul Kumar, Sichen Jin, Shatrughan Singh, Chanwoo Kim

In this paper, we propose an utterance invariant training (UIT) specifically designed to improve the performance of a two-pass end-to-end hybrid ASR. Our proposed hybrid ASR solution uses a shared encoder with a monotonic chunkwise attention (MoChA) decoder for streaming capabilities, while using a low-latency bidirectional full-attention (BFA) decoder for enhancing the overall ASR accuracy. A modified sequence summary network (SSN) based utterance invariant training is used to suit the two-pass model architecture. The input feature stream self-conditioned by scaling and shifting with its own sequence summary is used as a concatenative conditioning on the bidirectional encoder layers sitting on top of the shared encoder. In effect, the proposed utterance invariant training combines three different types of conditioning namely, concatenative, multiplicative and additive. Experimental results show that the proposed approach shows reduction in word error rates up to 7% relative on Librispeech, and 10–15% on a large scale Korean end-to-end two-pass hybrid ASR model.

 DOI: 10.21437/Interspeech.2020-3230

Cite as: Gowda, D., Kumar, A., Kim, K., Yang, H., Garg, A., Singh, S., Kim, J., Kumar, M., Jin, S., Singh, S., Kim, C. (2020) Utterance Invariant Training for Hybrid Two-Pass End-to-End Speech Recognition. Proc. Interspeech 2020, 2827-2831, DOI: 10.21437/Interspeech.2020-3230.

  author={Dhananjaya Gowda and Ankur Kumar and Kwangyoun Kim and Hejung Yang and Abhinav Garg and Sachin Singh and Jiyeon Kim and Mehul Kumar and Sichen Jin and Shatrughan Singh and Chanwoo Kim},
  title={{Utterance Invariant Training for Hybrid Two-Pass End-to-End Speech Recognition}},
  booktitle={Proc. Interspeech 2020},