Towards Context-Aware End-to-End Code-Switching Speech Recognition

Zimeng Qiu, Yiyuan Li, Xinjian Li, Florian Metze, William M. Campbell

Code-switching (CS) speech recognition is drawing increasing attention in recent years as it is a common situation in speech where speakers alternate between languages in the context of a single utterance or discourse. In this work, we propose Hierarchical Attention-based Recurrent Decoder (HARD) to build a context-aware end-to-end code-switching speech recognition system. HARD is an attention-based decoder model which employs a hierarchical recurrent network to enhance model’s awareness of previous generated historical sequence (sub-sequence) at decoding. This architecture has two LSTMs to model encoder hidden states at both the character level and sub-sequence level, therefore enables us to generate utterances that switch between languages more precisely from speech. We also employ language identification (LID) as an auxiliary task in multi-task learning (MTL) to boost speech recognition performance. We evaluate the effectiveness of our model on the SEAME dataset, results show that our multi-task learning HARD (MTL-HARD) model improves over the baseline Listen, Attend and Spell (LAS) model by reducing character error rate (CER) from 29.91% to 26.56% and mixed error rate (MER) from 38.99% to 34.50%, and case study shows MTL-HARD can carry historical information in the sub-sequences.

 DOI: 10.21437/Interspeech.2020-1980

Cite as: Qiu, Z., Li, Y., Li, X., Metze, F., Campbell, W.M. (2020) Towards Context-Aware End-to-End Code-Switching Speech Recognition. Proc. Interspeech 2020, 4776-4780, DOI: 10.21437/Interspeech.2020-1980.

  author={Zimeng Qiu and Yiyuan Li and Xinjian Li and Florian Metze and William M. Campbell},
  title={{Towards Context-Aware End-to-End Code-Switching Speech Recognition}},
  booktitle={Proc. Interspeech 2020},