12th Annual Conference of the International Speech Communication Association

Florence, Italy
August 27-31. 2011

Deep Learning of Speech Features for Improved Phonetic Recognition

Jaehyung Lee, Soo-Young Lee

KAIST, Korea

Recently, a remarkable performance result of 23.0% Phone Error Rate (PER) on the TIMIT core test set was reported by applying Deep Belief Network (DBN) on phonetic recognition [1]. Despite the good performance reported, there is still substantial room for improvement in the reported design in order to achieve optimal results. In this letter, we present an improved but simple architecture for phonetic recognition which uses logMel spectrum directly instead of MelFrequency Cepstral Coefficient (MFCC), and combines Deep Learning with conventional BaumWelch reestimation for subphoneme alignment. Experiments performed on TIMIT speech corpus show that the proposed method outperforms most of the conventional methods, yielding 21.4% PER on the complete test set of TIMIT and 22.1% on the core test set.


  1. A.R. Mohamed, G. Dahl, and G. Hinton, “Deep belief networks for phone recognition,” in NIPS Workshop on Deep Learning for Speech Recognition and Related Applications, Whistler, BC, Canada, Dec. 2009.

Full Paper

Bibliographic reference.  Lee, Jaehyung / Lee, Soo-Young (2011): "Deep learning of speech features for improved phonetic recognition", In INTERSPEECH-2011, 1249-1252.