13th Annual Conference of the International Speech Communication Association

Portland, OR, USA
September 9-13, 2012

Application of Pretrained Deep Neural Networks to Large Vocabulary Speech Recognition

Navdeep Jaitly (1), Patrick Nguyen (2), Andrew Senior (3), Vincent Vanhoucke (2)

(1) Dept. of Computer Science, University of Toronto, Toronto, ON, Canada
(2) Google, Inc, MoutainView, CA, USA
(3) Google, Inc, New York, NY, USA

The use of Deep Belief Networks (DBN) to pretrain Neural Networks has recently led to a resurgence in the use of Artificial Neural Network - Hidden Markov Model (ANN/HMM) hybrid systems for Automatic Speech Recognition (ASR). In this paper we report results of a DBN-pretrained contextdependent ANN/HMM system trained on two datasets that are much larger than any reported previously with DBN-pretrained ANN/HMM systems - 5870 hours of Voice Search and 1400 hours of YouTube data. On the first dataset, the pretrained ANN/HMM system outperforms the best Gaussian Mixture Model - Hidden Markov Model (GMM/HMM) baseline, built with a much larger dataset by 3.7% absolute WER, while on the second dataset, it outperforms the GMM/HMM baseline by 2.9% absolute. Maximum Mutual Information (MMI) fine tuning and model combination using Segmental Conditional Random Fields (SCARF) give additional gains of 0.1% and 0.4% on the first dataset and 0.6% and 1.1% absolute on the second dataset.

Index Terms: Deep Belief Networks, Acoustic Modeling, Artificial Neural Network, ANN/HMM

Full Paper

Bibliographic reference.  Jaitly, Navdeep / Nguyen, Patrick / Senior, Andrew / Vanhoucke, Vincent (2012): "Application of pretrained deep neural networks to large vocabulary speech recognition", In INTERSPEECH-2012, 2578-2581.