Efficient Minimum Word Error Rate Training of RNN-Transducer for End-to-End Speech Recognition

Jinxi Guo, Gautam Tiwari, Jasha Droppo, Maarten Van Segbroeck, Che-Wei Huang, Andreas Stolcke, Roland Maas


In this work, we propose a novel and efficient minimum word error rate (MWER) training method for RNN-Transducer (RNN-T). Unlike previous work on this topic, which performs on-the-fly limited-size beam-search decoding and generates alignment scores for expected edit-distance computation, in our proposed method, we re-calculate and sum scores of all the possible alignments for each hypothesis in N-best lists. The hypothesis probability scores and back-propagated gradients are calculated efficiently using the forward-backward algorithm. Moreover, the proposed method allows us to decouple the decoding and training processes, and thus we can perform offline parallel-decoding and MWER training for each subset iteratively. Experimental results show that this proposed semi-on-the-fly method can speed up the on-the-fly method by 6 times and result in a similar WER improvement (3.6%) over a baseline RNN-T model. The proposed MWER training can also effectively reduce high-deletion errors (9.2% WER-reduction) introduced by RNN-T models when EOS is added for end-pointer. Further improvement can be achieved if we use a proposed RNN-T rescoring method to re-rank hypotheses and use external RNN-LM to perform additional rescoring. The best system achieves a 5% relative improvement on an English test-set of real far-field recordings and a 11.6% WER reduction on music-domain utterances.


 DOI: 10.21437/Interspeech.2020-1557

Cite as: Guo, J., Tiwari, G., Droppo, J., Segbroeck, M.V., Huang, C., Stolcke, A., Maas, R. (2020) Efficient Minimum Word Error Rate Training of RNN-Transducer for End-to-End Speech Recognition. Proc. Interspeech 2020, 2807-2811, DOI: 10.21437/Interspeech.2020-1557.


@inproceedings{Guo2020,
  author={Jinxi Guo and Gautam Tiwari and Jasha Droppo and Maarten Van Segbroeck and Che-Wei Huang and Andreas Stolcke and Roland Maas},
  title={{Efficient Minimum Word Error Rate Training of RNN-Transducer for End-to-End Speech Recognition}},
  year=2020,
  booktitle={Proc. Interspeech 2020},
  pages={2807--2811},
  doi={10.21437/Interspeech.2020-1557},
  url={http://dx.doi.org/10.21437/Interspeech.2020-1557}
}