Distilling Knowledge from Ensembles of Neural Networks for Speech Recognition

Yevgen Chebotar, Austin Waters

Speech recognition systems that combine multiple types of acoustic models have been shown to outperform single-model systems. However, such systems can be complex to implement and too resource-intensive to use in production. This paper describes how to use knowledge distillation to combine acoustic models in a way that has the best of many worlds: It improves recognition accuracy significantly, can be implemented with standard training tools, and requires no additional complexity during recognition. First, we identify a simple but particularly strong type of ensemble: a late combination of recurrent neural networks with different architectures and training objectives. To harness such an ensemble, we use a variant of standard cross-entropy training to distill it into a single model and then discriminatively fine-tune the result. An evaluation on 2,000-hour large vocabulary tasks in 5 languages shows that the distilled models provide up to 8.9% relative WER improvement over conventionally-trained baselines with an identical number of parameters.

DOI: 10.21437/Interspeech.2016-1190

Cite as

Chebotar, Y., Waters, A. (2016) Distilling Knowledge from Ensembles of Neural Networks for Speech Recognition. Proc. Interspeech 2016, 3439-3443.

author={Yevgen Chebotar and Austin Waters},
title={Distilling Knowledge from Ensembles of Neural Networks for Speech Recognition},
booktitle={Interspeech 2016},