Contextual Prediction Models for Speech Recognition

Yoni Halpern, Keith Hall, Vlad Schogol, Michael Riley, Brian Roark, Gleb Skobeltsyn, Martin Bäuml

We introduce an approach to biasing language models towards known contexts without requiring separate language models or explicit contextually-dependent conditioning contexts. We do so by presenting an alternative ASR objective, where we predict the acoustics and words given the contextual cue, such as the geographic location of the speaker. A simple factoring of the model results in an additional biasing term, which effectively indicates how correlated a hypothesis is with the contextual cue (e.g., given the hypothesized transcript, how likely is the user’s known location). We demonstrate that this factorization allows us to train relatively small contextual models which are effective in speech recognition. An experimental analysis shows a perplexity reduction of up to 35% and a relative reduction in word error rate of 1.6% on a targeted voice search dataset when using the user’s coarse location as a contextual cue.

DOI: 10.21437/Interspeech.2016-1358

Cite as

Halpern, Y., Hall, K., Schogol, V., Riley, M., Roark, B., Skobeltsyn, G., Bäuml, M. (2016) Contextual Prediction Models for Speech Recognition. Proc. Interspeech 2016, 2338-2342.

author={Yoni Halpern and Keith Hall and Vlad Schogol and Michael Riley and Brian Roark and Gleb Skobeltsyn and Martin Bäuml},
title={Contextual Prediction Models for Speech Recognition},
booktitle={Interspeech 2016},