4th International Conference on Spoken Language Processing

Philadelphia, PA, USA
October 3-6, 1996

N-best-based Instantaneous Speaker Adaptation Method for Speech Recognition

Tomoko Matsui, Sadaoki Furui

NTT Human Interface Laboratories, Tokyo, Japan

An instantaneous speaker adaptation method is proposed that uses N-best decoding for continuous mixture-density hidden-Markov-model based speech recognition systems. An N-best paradigm of multiple-pass search strategies is used that makes this method effective even for speakers whose decodings using speaker-independent models are error-prone. To cope with an insufficient amount of data, our method uses constrained maximum a posteriori estimation, in which the parameter vector space is clustered, and a mixture-mean bias is estimated for each cluster. Moreover, to maintain continuity between clusters, a bias for each mixture-mean is calculated as the weighted sum of the estimated biases. Performance evaluation using connected-digit (four-digit strings) recognition experiments performed over actual telephone lines showed more than a 20% reduction in the error rates, even for speakers whose decodings using speaker-independent models were error-prone.

Full Paper

Bibliographic reference.  Matsui, Tomoko / Furui, Sadaoki (1996): "N-best-based instantaneous speaker adaptation method for speech recognition", In ICSLP-1996, 973-976.