Improving Transformer-Based Speech Recognition with Unsupervised Pre-Training and Multi-Task Semantic Knowledge Learning

Song Li, Lin Li, Qingyang Hong, Lingling Liu


Recently, the Transformer-based end-to-end speech recognition system has become a state-of-the-art technology. However, one prominent problem with current end-to-end speech recognition systems is that an extensive amount of paired data are required to achieve better recognition performance. In order to grapple with such an issue, we propose two unsupervised pre-training strategies for the encoder and the decoder of Transformer respectively, which make full use of unpaired data for training. In addition, we propose a new semi-supervised fine-tuning method named multi-task semantic knowledge learning to strengthen the Transformer’s ability to learn about semantic knowledge, thereby improving the system performance. We achieve the best CER with our proposed methods on AISHELL-1 test set: 5.9%, which exceeds the best end-to-end model by 10.6% relative CER. Moreover, relative CER reduction of 20.3% and 17.8% are obtained for low-resource Mandarin and English data sets, respectively.


 DOI: 10.21437/Interspeech.2020-2007

Cite as: Li, S., Li, L., Hong, Q., Liu, L. (2020) Improving Transformer-Based Speech Recognition with Unsupervised Pre-Training and Multi-Task Semantic Knowledge Learning. Proc. Interspeech 2020, 5006-5010, DOI: 10.21437/Interspeech.2020-2007.


@inproceedings{Li2020,
  author={Song Li and Lin Li and Qingyang Hong and Lingling Liu},
  title={{Improving Transformer-Based Speech Recognition with Unsupervised Pre-Training and Multi-Task Semantic Knowledge Learning}},
  year=2020,
  booktitle={Proc. Interspeech 2020},
  pages={5006--5010},
  doi={10.21437/Interspeech.2020-2007},
  url={http://dx.doi.org/10.21437/Interspeech.2020-2007}
}