Recently, a remarkable performance result of 23.0% Phone Error Rate (PER) on the TIMIT core test set was reported by applying Deep Belief Network (DBN) on phonetic recognition . Despite the good performance reported, there is still substantial room for improvement in the reported design in order to achieve optimal results. In this letter, we present an improved but simple architecture for phonetic recognition which uses logMel spectrum directly instead of MelFrequency Cepstral Coefficient (MFCC), and combines Deep Learning with conventional BaumWelch reestimation for subphoneme alignment. Experiments performed on TIMIT speech corpus show that the proposed method outperforms most of the conventional methods, yielding 21.4% PER on the complete test set of TIMIT and 22.1% on the core test set.
Bibliographic reference. Lee, Jaehyung / Lee, Soo-Young (2011): "Deep learning of speech features for improved phonetic recognition", In INTERSPEECH-2011, 1249-1252.