12th Annual Conference of the International Speech Communication Association

Florence, Italy
August 27-31. 2011

Feature Frame Stacking in RNN-Based Tandem ASR Systems - Learned vs. Predefined Context

Martin Wöllmer, Björn Schuller, Gerhard Rigoll

Technische Universität München, Germany

As phoneme recognition is known to profit from techniques that consider contextual information, neural networks applied in Tandem automatic speech recognition (ASR) systems usually employ some form of context modeling. While approaches based on multi-layer perceptrons or recurrent neural networks (RNN) are able to model a predefined amount of context by simultaneously processing a stacked sequence of successive feature vectors, bidirectional Long Short-Term Memory (BLSTM) networks were shown to be well-suited for incorporating a self-learned amount of context for phoneme prediction. In this paper, we evaluate combinations of BLSTM modeling and frame stacking to determine the most efficient method for exploiting context in RNN-based Tandem systems. Applying the COSINE corpus and our recently introduced multi-stream BLSTM-HMM decoder, we provide empirical evidence for the intuition that BLSTM networks redundantize frame stacking while RNNs profit from predefined feature-level context.

Full Paper

Bibliographic reference.  Wöllmer, Martin / Schuller, Björn / Rigoll, Gerhard (2011): "Feature frame stacking in RNN-based tandem ASR systems - learned vs. predefined context", In INTERSPEECH-2011, 1233-1236.