Multi-Encoder-Decoder Transformer for Code-Switching Speech Recognition

Xinyuan Zhou, Emre Yılmaz, Yanhua Long, Yijie Li, Haizhou Li


Code-switching (CS) occurs when a speaker alternates words of two or more languages within a single sentence or across sentences. Automatic speech recognition (ASR) of CS speech has to deal with two or more languages at the same time. In this study, we propose a Transformer-based architecture with two symmetric language-specific encoders to capture the individual language attributes, that improve the acoustic representation of each language. These representations are combined using a language-specific multi-head attention mechanism in the decoder module. Each encoder and its corresponding attention module in the decoder are pre-trained using a large monolingual corpus aiming to alleviate the impact of limited CS training data. We call such a network a multi-encoder-decoder (MED) architecture. Experiments on the SEAME corpus show that the proposed MED architecture achieves 10.2% and 10.8% relative error rate reduction on the CS evaluation sets with Mandarin and English as the matrix language respectively.


 DOI: 10.21437/Interspeech.2020-2488

Cite as: Zhou, X., Yılmaz, E., Long, Y., Li, Y., Li, H. (2020) Multi-Encoder-Decoder Transformer for Code-Switching Speech Recognition. Proc. Interspeech 2020, 1042-1046, DOI: 10.21437/Interspeech.2020-2488.


@inproceedings{Zhou2020,
  author={Xinyuan Zhou and Emre Yılmaz and Yanhua Long and Yijie Li and Haizhou Li},
  title={{Multi-Encoder-Decoder Transformer for Code-Switching Speech Recognition}},
  year=2020,
  booktitle={Proc. Interspeech 2020},
  pages={1042--1046},
  doi={10.21437/Interspeech.2020-2488},
  url={http://dx.doi.org/10.21437/Interspeech.2020-2488}
}