Speech Transformer with Speaker Aware Persistent Memory

Yingzhu Zhao, Chongjia Ni, Cheung-Chi Leung, Shafiq Joty, Eng Siong Chng, Bin Ma


End-to-end models have been introduced into automatic speech recognition (ASR) successfully and achieved superior performance compared with conventional hybrid systems, especially with the newly proposed transformer model. However, speaker mismatch between training and test data remains a problem, and speaker adaptation for transformer model can be further improved. In this paper, we propose to conduct speaker aware training for ASR in transformer model. Specifically, we propose to embed speaker knowledge through a persistent memory model into speech transformer encoder at utterance level. The speaker information is represented by a number of static speaker i-vectors, which is concatenated to speech utterance at each encoder self-attention layer. Persistent memory is thus formed by carrying speaker information through the depth of encoder. The speaker knowledge is captured from self-attention between speech and persistent memory vector in encoder. Experiment results on LibriSpeech, Switchboard and AISHELL-1 ASR task show that our proposed model brings relative 4.7%–12.5% word error rate (WER) reductions, and achieves superior results compared with other models with the same objective. Furthermore, our model brings relative 2.1%–8.3% WER reductions compared with the first persistent memory model used in ASR.


 DOI: 10.21437/Interspeech.2020-1281

Cite as: Zhao, Y., Ni, C., Leung, C., Joty, S., Chng, E.S., Ma, B. (2020) Speech Transformer with Speaker Aware Persistent Memory. Proc. Interspeech 2020, 1261-1265, DOI: 10.21437/Interspeech.2020-1281.


@inproceedings{Zhao2020,
  author={Yingzhu Zhao and Chongjia Ni and Cheung-Chi Leung and Shafiq Joty and Eng Siong Chng and Bin Ma},
  title={{Speech Transformer with Speaker Aware Persistent Memory}},
  year=2020,
  booktitle={Proc. Interspeech 2020},
  pages={1261--1265},
  doi={10.21437/Interspeech.2020-1281},
  url={http://dx.doi.org/10.21437/Interspeech.2020-1281}
}