1-D Row-Convolution LSTM: Fast Streaming ASR at Accuracy Parity with LC-BLSTM

Kshitiz Kumar, Chaojun Liu, Yifan Gong, Jian Wu

In this work we develop a simple, efficient, and compact automatic speech recognition (ASR) model based on purely 1-dimensional row convolution (RC) operation. We refer to our proposed model as 1-dim row-convolution LSTM (RC-LSTM), where we embed limited future information to standard UniLSTMs in 1-dim RC operation. We target fast streaming ASR solutions and establish ASR accuracy parity with latency-control bidirectional-LSTM (LC-BLSTM). We develop an application of future information at ASR features and hidden layer stages. We study connections with related techniques, analyze tradeoffs and recommend uniform future lookahead to all hidden layers. We argue that our architecture implicitly factorizes training into orthogonal time and “frequency” dimensions for an effective learning on large scale tasks. We conduct a series of experiments on medium scale with 6k hrs of English corpus, as well as, large scale with 60k hrs training. We demonstrate our findings across unified ASR tasks. Compared to UniLSTM model, RC-LSTM achieved 16% relative reduction in word error rate (WER). RC-LSTM also achieved accuracy parity with LC-BLSTM on large scale tasks at significantly lower latency and computational cost.

 DOI: 10.21437/Interspeech.2020-2894

Cite as: Kumar, K., Liu, C., Gong, Y., Wu, J. (2020) 1-D Row-Convolution LSTM: Fast Streaming ASR at Accuracy Parity with LC-BLSTM. Proc. Interspeech 2020, 2107-2111, DOI: 10.21437/Interspeech.2020-2894.

  author={Kshitiz Kumar and Chaojun Liu and Yifan Gong and Jian Wu},
  title={{1-D Row-Convolution LSTM: Fast Streaming ASR at Accuracy Parity with LC-BLSTM}},
  booktitle={Proc. Interspeech 2020},