Robust Speech Recognition Using Generalized Distillation Framework

Konstantin Markov, Tomoko Matsui

In this paper, we propose a noise robust speech recognition system built using generalized distillation framework. It is assumed that during training, in addition to the training data, some kind of “privileged” information is available and can be used to guide the training process. This allows to obtain a system which at test time outperforms those built on regular training data alone. In the case of noisy speech recognition task, the privileged information is obtained from a model, called “teacher”, trained on clean speech only. The regular model, called “student”, is trained on noisy utterances and uses teacher’s output for the corresponding clean utterances. Thus, for this framework a parallel clean/noisy speech data are required. We experimented on the Aurora2 database which provides such kind of data. Our system uses hybrid DNN-HMM acoustic model where neural networks provide HMM state probabilities during decoding. The teacher DNN is trained on the clean data, while the student DNN is trained using multi-condition (various SNRs) data. The student DNN loss function combines the targets obtained from forced alignment of the training data and the outputs of the teacher DNN when fed with the corresponding clean features. Experimental results clearly show that distillation framework is effective and allows to achieve significant reduction in the word error rate.

DOI: 10.21437/Interspeech.2016-852

Cite as

Markov, K., Matsui, T. (2016) Robust Speech Recognition Using Generalized Distillation Framework. Proc. Interspeech 2016, 2364-2368.

author={Konstantin Markov and Tomoko Matsui},
title={Robust Speech Recognition Using Generalized Distillation Framework},
booktitle={Interspeech 2016},