Semi-Supervised Acoustic Model Training for Five-Lingual Code-Switched ASR

Astik Biswas, Emre Yılmaz, Febe de Wet, Ewald van der Westhuizen, Thomas Niesler

This paper presents recent progress in the acoustic modelling of under-resourced code-switched (CS) speech in multiple South African languages. We consider two approaches. The first constructs separate bilingual acoustic models corresponding to language pairs (English-isiZulu, English-isiXhosa, English-Setswana and English-Sesotho). The second constructs a single unified five-lingual acoustic model representing all the languages (English, isiZulu, isiXhosa, Setswana and Sesotho). For these two approaches we consider the effectiveness of semi-supervised training to increase the size of the very sparse acoustic training sets. Using approximately 11 hours of untranscribed speech, we show that both approaches benefit from semi-supervised training. The bilingual TDNN-F acoustic models also benefit from the addition of CNN layers (CNN-TDNN-F), while the five-lingual system does not show any significant improvement. Furthermore, because English is common to all language pairs in our data, it dominates when training a unified language model, leading to improved English ASR performance at the expense of the other languages. Nevertheless, the five-lingual model offers flexibility because it can process more than two languages simultaneously, and is therefore an attractive option as an automatic transcription system in a semi-supervised training pipeline.

 DOI: 10.21437/Interspeech.2019-1325

Cite as: Biswas, A., Yılmaz, E., Wet, F.D., Westhuizen, E.V.D., Niesler, T. (2019) Semi-Supervised Acoustic Model Training for Five-Lingual Code-Switched ASR. Proc. Interspeech 2019, 3745-3749, DOI: 10.21437/Interspeech.2019-1325.

  author={Astik Biswas and Emre Yılmaz and Febe de Wet and Ewald van der Westhuizen and Thomas Niesler},
  title={{Semi-Supervised Acoustic Model Training for Five-Lingual Code-Switched ASR}},
  booktitle={Proc. Interspeech 2019},