September 22-25, 1997
This paper proposes a method to transform acoustic models (HMM gaussian mixtures) that have been trained on a certain group of speakers for use on speech from a different group of speakers. Cepstral features are transformed on the basis of assumptions regarding the difference in vocal tract length (VTL) between the groups of speakers (VTL normalisation, VTLN). Firstly, the VTL of these groups has been estimated based on the average third formant F . Secondly, the linear acoustic theory of speech production has been applied to warp the spectral characteristics of the existing models so as to match the incoming speech. The mapping is composed of subsequent non-linear submappings. By locally linearizing it, a linear approximation was obtained which is accurate as long as warping is reasonably small. The method has been tested for the TI digits database, containing adult and kids speech, consisting of isolated digits and digit strings of different length. The word error rate when trained on adults and tested on kids with transformed adult models is decreased by more than a factor of 2 compared to the non-transformed case.
Bibliographic reference. Claes, Tom / Dologlou, Ioannis / Bosch, Louis ten / Compernolle, Dirk Van (1997): "New transformations of cepstral parameters for automatic vocal tract length normalization in speech recognition", In EUROSPEECH-1997, 1363-1366.