13th Annual Conference of the International Speech Communication Association

Portland, OR, USA
September 9-13, 2012

Phrasal Cohort Based Unsupervised Discriminative Language Modeling

Puyang Xu (1), Brian Roark (2), Sanjeev Khudanpur (1)

(1) Center for Language and Speech Processing, Johns Hopkins University, USA
(2) Center for Spoken Language Understanding Oregon Health and Sciences University, USA

Simulated confusions enable the use of large text-only corpora for discriminative language modeling by hallucinating the likely recognition outputs that each (correct) sentence would be confused with. In [1], a novel approach was introduced to simulate confusions using phrasal cohorts derived directly from recognition output. However, the described approach relied on transcribed speech to derive cohorts. In this paper, we extend the phrasal cohort technique to the fully unsupervised scenario, where transcribed data are completely absent. Experimental results show that even if the cohorts are extracted from untranscribed speech, the unsupervised training can still achieve over 40% of the gains of the supervised approach. The results are presented on NIST data sets for a state-of-the-art LVCSR system.

Index Terms: unsupervised training, discriminative language modeling


  1. Sagae, K., Lehr, M., Prud'hommeaux, E., Xu, P., Glenn, N., Karakos, D., Khudanpur, S., Roark, B., Saraclar, M., Shafran, I., Bikel, D., Callison-Burch, C., Cao, Y., Hall, K., Hasler, E., Koehn, P., Lopez, A., Post, M. and Riley, D., “Hallucinated n-best lists for discriminative language modeling”, in Proc. ICASSP, 2012.

Full Paper

Bibliographic reference.  Xu, Puyang / Roark, Brian / Khudanpur, Sanjeev (2012): "Phrasal cohort based unsupervised discriminative language modeling", In INTERSPEECH-2012, 198-201.