September 22-25, 1997
We refer to environment e as some combination of speaker, handset, transmission channel and noise background condition, and regard any practical situation of a speech recognizer as a mixture of environments. A speech recognizer may be trained on multi-environment data. It may also need to adapt the trained acoustic models to new conditions. How to train an HMM with multi-environment data and from what seed model to start an adaptation are two questions of great importance. We propose a new solution to speech recognition which is based on, for both training and adaptation, a separate modeling of phonetic variation and environment variations. This problem is formulated under hidden Markov process, where we assume, - Speech x is generated by some canonical (independent ofenvironmental factors) distributions, - An unknown linear transformation We and a bias be, specific to environment e, is applied to x with probability P(e), - x cannot be observed, what we observe is the outcome of the transformation: o = Wex + be. Under maximum-likelihood (ML) criterion, by application of EM algorithm and the extension of Baum's forward and backward variables and algorithm, we obtained a joint solution to the parameters of the canonical distributions, the transformations and the biases, which is novel. For special cases, on a noisy telephone speech database, the new formulation is compared to per-utterance cepstral mean normalization (CMN) technique and shows more than 20% word error rate improvement.
Bibliographic reference. Gong, Yifan (1997): "Source normalization training for HMM applied to noisy telephone speech recognition", In EUROSPEECH-1997, 1555-1558.