Sequence Student-Teacher Training of Deep Neural Networks

Jeremy H.M. Wong, Mark J.F. Gales

The performance of automatic speech recognition can often be significantly improved by combining multiple systems together. Though beneficial, ensemble methods can be computationally expensive, often requiring multiple decoding runs. An alternative approach, appropriate for deep learning schemes, is to adopt student-teacher training. Here, a student model is trained to reproduce the outputs of a teacher model, or ensemble of teachers. The standard approach is to train the student model on the frame posterior outputs of the teacher. This paper examines the interaction between student-teacher training schemes and sequence training criteria, which have been shown to yield significant performance gains over frame-level criteria. There are several possible options for integrating sequence training, including training of the ensemble and further training of the student. This paper also proposes an extension to the student-teacher framework, where the student is trained to emulate the hypothesis posterior distribution of the teacher, or ensemble of teachers. This sequence student-teacher training approach allows the benefit of student-teacher training to be directly combined with sequence training schemes. These approaches are evaluated on two speech recognition tasks: a Wall Street Journal based task and a low-resource Tok Pisin conversational telephone speech task from the IARPA Babel programme.

DOI: 10.21437/Interspeech.2016-911

Cite as

Wong, J.H., Gales, M.J. (2016) Sequence Student-Teacher Training of Deep Neural Networks. Proc. Interspeech 2016, 2761-2765.

author={Jeremy H.M. Wong and Mark J.F. Gales},
title={Sequence Student-Teacher Training of Deep Neural Networks},
booktitle={Interspeech 2016},