Exploiting Monolingual Speech Corpora for Code-Mixed Speech Recognition

Karan Taneja, Satarupa Guha, Preethi Jyothi, Basil Abraham

One of the main challenges in building code-mixed ASR systems is the lack of annotated speech data. Often, however, monolingual speech corpora are available in abundance for the languages in the code-mixed speech. In this paper, we explore different techniques that use monolingual speech to create synthetic code-mixed speech and examine their effect on training models for code-mixed ASR. We assume access to a small amount of real code-mixed text, from which we extract probability distributions that govern the transition of phones across languages at code-switch boundaries and the span lengths corresponding to a particular language. We extract segments from monolingual data and concatenate them to form code-mixed utterances such that these probability distributions are preserved. Using this synthetic speech, we show significant improvements in Hindi-English code-mixed ASR performance compared to using synthetic speech naively constructed from complete utterances in different languages. We also present language modelling experiments that use synthetically constructed code-mixed text and discuss their benefits.

 DOI: 10.21437/Interspeech.2019-1959

Cite as: Taneja, K., Guha, S., Jyothi, P., Abraham, B. (2019) Exploiting Monolingual Speech Corpora for Code-Mixed Speech Recognition. Proc. Interspeech 2019, 2150-2154, DOI: 10.21437/Interspeech.2019-1959.

  author={Karan Taneja and Satarupa Guha and Preethi Jyothi and Basil Abraham},
  title={{Exploiting Monolingual Speech Corpora for Code-Mixed Speech Recognition}},
  booktitle={Proc. Interspeech 2019},