Comparison of BLSTM-Layer-Specific Affine Transformations for Speaker Adaptation

Markus Kitza, Ralf Schlüter, Hermann Ney

Bidirectional Long Short-Term Memory (BLSTM) Recurrent Neural Networks (RNN) acoustic models have demonstrated superior performance over Deep feed-forward Neural Networks (DNN) models in speech recognition and many other tasks. Although, a lot of work has been reported on DNN model adaptation, very little has been done on BLSTM model adaptation. This work presents a systematic study on the adaptation of BLSTM acoustic models by means of learning affine transformations within the neural network on small amounts of unsupervised adaptation data. Through a series of experiments on two major speech recognition benchmarks (Switchboard and CHiME-4), we investigate the significance of the position of the transformation in a BLSTM Network using a separate transformation for the forward- and backward-direction. We observe that applying affine transformations result in consistent relative word error rate reductions ranging from 6% to 11% depending on the task and the degree of mismatch between training and test data.

 DOI: 10.21437/Interspeech.2018-2022

Cite as: Kitza, M., Schlüter, R., Ney, H. (2018) Comparison of BLSTM-Layer-Specific Affine Transformations for Speaker Adaptation. Proc. Interspeech 2018, 877-881, DOI: 10.21437/Interspeech.2018-2022.

  author={Markus Kitza and Ralf Schlüter and Hermann Ney},
  title={Comparison of BLSTM-Layer-Specific Affine Transformations for Speaker Adaptation},
  booktitle={Proc. Interspeech 2018},