This paper presents two novel language identification algorithms for use with minimal training data. No transcriptions or dictionaries are required for training, as the acoustic models are based on English speech only and the language models are derived from phonetic sequences generated by an HMM recognizer. In the first approach, pseudo-words are generated from the output of a phoneme recognizer by a sub-string alignment algorithm followed by agglomerative clustering of the aligned sub-strings. A bigram language model incorporating phonemes and pseudo-words is built for each language, and HMM likelihood scores (including contributions from both acoustic models and language models) are used in language discrimination. In the second approach, an iterative language model estimation algorithm is used. Language-pair discrimination experiments on the OGI multi-language telephone speech corpus show that both new methods provide an effective characterization of the languages to be identified.
Bibliographic reference. Lund, Michael A. / Gish, Herbert (1995): "Two novel language model estimation techniques for statistical language identification", In EUROSPEECH-1995, 1363-1366.