13th Annual Conference of the International Speech Communication Association

Portland, OR, USA
September 9-13, 2012

Intrinsic Spectral Analysis for Zero and High Resource Speech Recognition

Aren Jansen, Samuel Thomas, Hynek Hermansky

Human Language Technology Center of Excellence, Center for Language and Speech Processing Johns Hopkins University, Baltimore, MD, USA

The constraints of the speech production apparatus imply that our vocalizations are approximately restricted to a low-dimensional manifold embedded in a high-dimensional space. Manifold learning algorithms provide a means to recover the approximate embedding from untranscribed data and enable use of the manifold's intrinsic distance metric to characterize acoustic similarity for downstream automatic speech applications. In this paper, we consider a previously unevaluated nonlinear out-of-sample extension for intrinsic spectral analysis (ISA), investigating its performance in both unsupervised and supervised tasks. In the zero resource regime, where the lack of transcribed resources forces us to rely solely on the phonetic salience of the acoustic features themselves, ISA provides substantial gains relative to canonical acoustic front-ends. When large amounts of transcribed speech for supervised acoustic model training are also available, we find that the data-driven intrinsic spectrogram matches the performance of and is complementary to these signal processing derived counterparts.

Index Terms: intrinsic spectral analysis, manifold learning, speech recognition, zero resource

Full Paper

Bibliographic reference.  Jansen, Aren / Thomas, Samuel / Hermansky, Hynek (2012): "Intrinsic spectral analysis for zero and high resource speech recognition", In INTERSPEECH-2012, 879-882.