Leveraging Unlabeled Speech for Sequence Discriminative Training of Acoustic Models

Ashtosh Sapru, Sri Garimella

State-of-the-art Acoustic Modeling (AM) techniques use long short term memory (LSTM) networks, and apply multiple phases of training on large amount of labeled acoustic data — initial cross-entropy (CE) training or connectionist temporal classification (CTC) training followed by sequence discriminative training, such as state-level Minimum Bayes Risk (sMBR). Recently, there is considerable interest in applying Semi-Supervised Learning (SSL) methods that leverage substantial amount of unlabeled speech for improving AM. This paper proposes a novel Teacher-Student based knowledge distillation (KD) approach for sequence discriminative training, where reference state sequence of unlabeled data are estimated using a strong Bi-directional LSTM Teacher model which is then used to guide the sMBR training of a LSTM Student model. We build a strong supervised LSTM AM baseline by using 45000 hours of labeled multi-dialect English data for initial CE or CTC training stage, and 11000 hours of its British English subset for sMBR training phase. To demonstrate the efficacy of the proposed approach, we leverage an additional 38000 hours of unlabeled British English data at only sMBR stage, which yields a relative Word Error Rate (WER) improvement in the range of 6%–11% over supervised baselines in clean and noisy test conditions.

 DOI: 10.21437/Interspeech.2020-2056

Cite as: Sapru, A., Garimella, S. (2020) Leveraging Unlabeled Speech for Sequence Discriminative Training of Acoustic Models. Proc. Interspeech 2020, 3585-3589, DOI: 10.21437/Interspeech.2020-2056.

  author={Ashtosh Sapru and Sri Garimella},
  title={{Leveraging Unlabeled Speech for Sequence Discriminative Training of Acoustic Models}},
  booktitle={Proc. Interspeech 2020},