13th Annual Conference of the International Speech Communication Association

Portland, OR, USA
September 9-13, 2012

Data-driven Posterior Features for Low Resource Speech Recognition Applications

Samuel Thomas (1), Sriram Ganapathy (1), Aren Jansen (1,2), Hynek Hermansky (1,2)

(1) Center for Language and Speech Processing; (2) Human Language Technology Center of Excellence;
The Johns Hopkins University, Baltimore, MD, USA

In low resource settings, with very few hours of training data,state-of-the-art speech recognition systems that require large amounts of task specific training data perform very poorly. We address this issue by building data-driven speech recognition front-ends on significant amounts of task independent data from different languages and genres collected in similar acoustic conditions as data provided in the low resource scenario. We show that features derived from these trained front-ends perform significantly better and can alleviate the effect of reduced task specific training data in low resource settings. The proposed features provide a absolute improvement of about 12% (18% relative) in an low-resource LVCSR setting with only one hour of training data. We also demonstrate the usefulness of these features for zero-resource speech applications like spoken term discovery, which operate without any transcribed speech to train systems. The proposed features provide significant gains over conventional acoustic features on various information retrieval metrics for this task.

Index Terms: Low-resource speech recognition, spoken term discovery, posterior features.

Full Paper

Bibliographic reference.  Thomas, Samuel / Ganapathy, Sriram / Jansen, Aren / Hermansky, Hynek (2012): "Data-driven posterior features for low resource speech recognition applications", In INTERSPEECH-2012, 791-794.