Speech-XLNet: Unsupervised Acoustic Model Pretraining for Self-Attention Networks

Xingchen Song, Guangsen Wang, Yiheng Huang, Zhiyong Wu, Dan Su, Helen Meng

Self-attention network (SAN) can benefit significantly from the bi-directional representation learning through unsupervised pre-training paradigms such as BERT and XLNet. In this paper, we present an XLNet-like pretraining scheme “Speech-XLNet” to learn speech representations with self-attention networks (SANs). Firstly, we find that by shuffling the speech frame orders, Speech-XLNet serves as a strong regularizer which encourages the SAN network to make inferences by focusing on global structures through its attention weights. Secondly, Speech-XLNet also allows the model to explore bi-directional context information while maintaining the autoregressive training manner. Visualization results show that our approach can generalize better with more flattened and widely distributed optimas compared to the conventional approach. Experimental results on TIMIT demonstrate that Speech-XLNet greatly improves hybrid SAN/HMM in terms of both convergence speed and recognition accuracy. Our best systems achieve a relative improvement of 15.2% on the TIMIT task. Besides, we also apply our pretrained model to an End-to-End SAN with WSJ dataset and WER is reduced by up to 68% when only a few hours of transcribed data is used.

 DOI: 10.21437/Interspeech.2020-1511

Cite as: Song, X., Wang, G., Huang, Y., Wu, Z., Su, D., Meng, H. (2020) Speech-XLNet: Unsupervised Acoustic Model Pretraining for Self-Attention Networks. Proc. Interspeech 2020, 3765-3769, DOI: 10.21437/Interspeech.2020-1511.

  author={Xingchen Song and Guangsen Wang and Yiheng Huang and Zhiyong Wu and Dan Su and Helen Meng},
  title={{Speech-XLNet: Unsupervised Acoustic Model Pretraining for Self-Attention Networks}},
  booktitle={Proc. Interspeech 2020},