Speaker Adaptive Training for Speech Recognition Based on Attention-Over-Attention Mechanism

Genshun Wan, Jia Pan, Qingran Wang, Jianqing Gao, Zhongfu Ye


In our previous work, we introduced a speaker adaptive training method based on frame-level attention mechanism for speech recognition, which has been proved an effective way to do speaker adaptive training. In this paper, we present an improved method by introducing the attention-over-attention mechanism. This attention module is used to further measure the contribution of each frame to the speaker embeddings in an utterance, and then generate an utterance-level speaker embedding to perform speaker adaptive training. Compared with the frame-level ones, the generated utterance-level speaker embeddings are more representative and stable. Experiments on both the Switchboard and AISHELL-2 tasks show that our method can achieve a relative word error rate reduction of approximately 8.0% compared with the speaker independent model, and over 6.0% compared with the traditional utterance-level d-vector-based speaker adaptive training method.


 DOI: 10.21437/Interspeech.2020-1727

Cite as: Wan, G., Pan, J., Wang, Q., Gao, J., Ye, Z. (2020) Speaker Adaptive Training for Speech Recognition Based on Attention-Over-Attention Mechanism. Proc. Interspeech 2020, 1251-1255, DOI: 10.21437/Interspeech.2020-1727.


@inproceedings{Wan2020,
  author={Genshun Wan and Jia Pan and Qingran Wang and Jianqing Gao and Zhongfu Ye},
  title={{Speaker Adaptive Training for Speech Recognition Based on Attention-Over-Attention Mechanism}},
  year=2020,
  booktitle={Proc. Interspeech 2020},
  pages={1251--1255},
  doi={10.21437/Interspeech.2020-1727},
  url={http://dx.doi.org/10.21437/Interspeech.2020-1727}
}