Learning N-Gram Language Models from Uncertain Data

Vitaly Kuznetsov, Hank Liao, Mehryar Mohri, Michael Riley, Brian Roark

We present a new algorithm for efficiently training n-gram language models on uncertain data, and illustrate its use for semi-supervised language model adaptation. We compute the probability that an n-gram occurs k times in the sample of uncertain data, and use the resulting histograms to derive a generalized Katz back-off model. We compare three approaches to semi-supervised adaptation of language models for speech recognition of selected YouTube video categories: (1) using just the one-best output from the baseline speech recognizer or (2) using samples from lattices with standard algorithms versus (3) using full lattices with our new algorithm. Unlike the other methods, our new algorithm provides models that yield solid improvements over the baseline on the full test set, and, further, achieves these gains without hurting performance on any of the set of video categories. We show that categories with the most data yielded the largest gains. The algorithm has been released as part of the OpenGrm n-gram library [1].

DOI: 10.21437/Interspeech.2016-1093

Cite as

Kuznetsov, V., Liao, H., Mohri, M., Riley, M., Roark, B. (2016) Learning N-Gram Language Models from Uncertain Data. Proc. Interspeech 2016, 2323-2327.

author={Vitaly Kuznetsov and Hank Liao and Mehryar Mohri and Michael Riley and Brian Roark},
title={Learning N-Gram Language Models from Uncertain Data},
booktitle={Interspeech 2016},