Learning Speaker Aware Offsets for Speaker Adaptation of Neural Networks

Leda Sarı, Samuel Thomas, Mark A. Hasegawa-Johnson

In this work, we present an unsupervised long short-term memory (LSTM) layer normalization technique that we call adaptation by speaker aware offsets (ASAO). These offsets are learned using an auxiliary network attached to the main senone classifier. The auxiliary network takes main network LSTM activations as input and tries to reconstruct speaker, (speaker,phone) and (speaker,senone)-level averages of the activations by minimizing the mean-squared error. Once the auxiliary network is jointly trained with the main network, during test time we do not need additional information for the test data as the network will generate the offset itself. Unlike many speaker adaptation studies which only adapt fully connected layers, our method is applicable to LSTM layers in addition to fully-connected layers. In our experiments, we investigate the effect of ASAO of LSTM layers at different depths. We also show its performance when the inputs are already speaker adapted by feature space maximum likelihood linear regression (fMLLR). In addition, we compare ASAO with a speaker adversarial training framework. ASAO achieves higher senone classification accuracy and lower word error rate (WER) than both the unadapted models and the adversarial model on the HUB4 dataset, with an absolute WER reduction of up to 2%.

 DOI: 10.21437/Interspeech.2019-1788

Cite as: Sarı, L., Thomas, S., Hasegawa-Johnson, M.A. (2019) Learning Speaker Aware Offsets for Speaker Adaptation of Neural Networks. Proc. Interspeech 2019, 769-773, DOI: 10.21437/Interspeech.2019-1788.

  author={Leda Sarı and Samuel Thomas and Mark A. Hasegawa-Johnson},
  title={{Learning Speaker Aware Offsets for Speaker Adaptation of Neural Networks}},
  booktitle={Proc. Interspeech 2019},