13th Annual Conference of the International Speech Communication Association

Portland, OR, USA
September 9-13, 2012

Context-Dependent MLPs for LVCSR: TANDEM, Hybrid or Both?

Zoltán Tüske, Martin Sundermeyer, Ralf Schlüter, Hermann Ney

Human Language Technology and Pattern Recognition, Computer Science Department, RWTH Aachen University, Germany

Gaussian Mixture Model (GMM) and Multi Layer Perceptron (MLP) based acoustic models are compared on a French large vocabulary continuous speech recognition (LVCSR) task. In addition to optimizing the output layer size of the MLP, the ef- fect of the deep neural network structure is also investigated. Moreover, using different linear transformations (time deriva- tives, LDA, CMLLR) on conventional MFCC, the study is also extended to MLP based probabilistic and bottle-neck TANDEM features. Results show that using either the hybrid or bottle- neck TANDEM approach leads to similar recognition perfor- mance. However, the best performance is achieved when deep MLP acoustic models are trained on concatenated cepstral and context-dependent bottle-neck features. Further experiments re- veal the importance of the neighbouring frames in case of MLP based modeling, and that its gain over GMM acoustic models is strongly reduced by more complex features.

Index Terms: HMM, GMM, MLP, bottle-neck, hybrid, ASR, TANDEM

Full Paper

Bibliographic reference.  Tüske, Zoltán / Sundermeyer, Martin / Schlüter, Ralf / Ney, Hermann (2012): "Context-dependent MLPs for LVCSR: TANDEM, hybrid or both?", In INTERSPEECH-2012, 18-21.