Knowledge Distillation from Offline to Streaming RNN Transducer for End-to-End Speech Recognition

Gakuto Kurata, George Saon


End-to-end training of recurrent neural network transducers (RNN-Ts) does not require frame-level alignments between audio and output symbols. Because of that, the posterior lattices defined by the predictive distributions from different RNN-Ts trained on the same data can differ a lot, which poses a new set of challenges in knowledge distillation between such models. These discrepancies are especially prominent in the posterior lattices between an offline model and a streaming model, which can be expected from the fact that the streaming RNN-T emits symbols later than the offline RNN-T. We propose a method to train an RNN-T so that the posterior peaks at each node in the posterior lattice are aligned with the ones from a pretrained model for the same utterance. By utilizing this method, we can train an offline RNN-T that can serve as a good teacher to train a student streaming RNN-T. Experimental results on the standard Switchboard conversational telephone speech corpus demonstrate accuracy improvements for a streaming unidirectional RNN-T by knowledge distillation from an offline bidirectional counterpart.


 DOI: 10.21437/Interspeech.2020-2442

Cite as: Kurata, G., Saon, G. (2020) Knowledge Distillation from Offline to Streaming RNN Transducer for End-to-End Speech Recognition. Proc. Interspeech 2020, 2117-2121, DOI: 10.21437/Interspeech.2020-2442.


@inproceedings{Kurata2020,
  author={Gakuto Kurata and George Saon},
  title={{Knowledge Distillation from Offline to Streaming RNN Transducer for End-to-End Speech Recognition}},
  year=2020,
  booktitle={Proc. Interspeech 2020},
  pages={2117--2121},
  doi={10.21437/Interspeech.2020-2442},
  url={http://dx.doi.org/10.21437/Interspeech.2020-2442}
}