Semi-Supervised End-to-End Speech Recognition

Shigeki Karita, Shinji Watanabe, Tomoharu Iwata, Atsunori Ogawa, Marc Delcroix

We propose a novel semi-supervised method for end-to-end automatic speech recognition (ASR). It can exploit large unpaired speech and text datasets, which require much less human effort to create paired speech-to-text datasets. Our semi-supervised method targets the extraction of an intermediate representation between speech and text data using a shared encoder network. Autoencoding of text data with this shared encoder improves the feature extraction of text data as well as that of speech data when the intermediate representations of speech and text are similar to each other as an inter-domain feature. In other words, by combining speech-to-text and text-to-text mappings through the shared network, we can improve speech-to-text mapping by learning to reconstruct the unpaired text data in a semi-supervised end-to-end manner. We investigate how to design suitable inter-domain loss, which minimizes the dissimilarity between the encoded speech and text sequences, which originally belong to quite different domains. The experimental results we obtained with our proposed semi-supervised training shows a larger character error rate reduction from 15.8% to 14.4% than a conventional language model integration on the Wall Street Journal dataset.

 DOI: 10.21437/Interspeech.2018-1746

Cite as: Karita, S., Watanabe, S., Iwata, T., Ogawa, A., Delcroix, M. (2018) Semi-Supervised End-to-End Speech Recognition. Proc. Interspeech 2018, 2-6, DOI: 10.21437/Interspeech.2018-1746.

  author={Shigeki Karita and Shinji Watanabe and Tomoharu Iwata and Atsunori Ogawa and Marc Delcroix},
  title={Semi-Supervised End-to-End Speech Recognition},
  booktitle={Proc. Interspeech 2018},