13th Annual Conference of the International Speech Communication Association

Portland, OR, USA
September 9-13, 2012

Discriminative Feature-space Transforms Using Deep Neural Networks

George Saon, Brian Kingsbury

IBM T. J. Watson Research Center, Yorktown Heights, NY, USA

We present a deep neural network (DNN) architecture which learns time-dependent offsets to acoustic feature vectors according to a discriminative objective function such as maximum mutual information (MMI) between the reference words and the transformed acoustic observation sequence. A key ingredient in this technique is a greedy layer-wise pretraining of the network based on minimum squared error between the DNN outputs and the offsets provided by a linear feature-space MMI (FMMI) transform. Next, the weights of the pretrained network are updated with stochastic gradient ascent by backpropagating the MMI gradient through the DNN layers. Experiments on a 50 hour English broadcast news transcription task show a 4% relative improvement using a 6-layer DNN transform over a state-of-the-art speakeradapted system with FMMI and model-space discriminative training.

Index Terms: speech recognition, deep neural networks

Full Paper

Bibliographic reference.  Saon, George / Kingsbury, Brian (2012): "Discriminative feature-space transforms using deep neural networks", In INTERSPEECH-2012, 14-17.