Phoneme-to-Grapheme Conversion Based Large-Scale Pre-Training for End-to-End Automatic Speech Recognition

Ryo Masumura, Naoki Makishima, Mana Ihori, Akihiko Takashima, Tomohiro Tanaka, Shota Orihashi


This paper describes a simple and efficient pre-training method using a large number of external texts to enhance end-to-end automatic speech recognition (ASR). Generally, it is essential to prepare speech-to-text paired data to construct end-to-end ASR models, but it is difficult to collect a large amount of such data in practice. One issue caused by data scarcity is that the performance of ASR on out-of-domain tasks different from those using the speech-to-text paired data is poor, since the mapping from the speech information to textual information is not well learned. To address this problem, we leverage a large number of phoneme-to-grapheme (P2G) paired data, which can be easily created from external texts and a rich pronunciation dictionary. The P2G conversion and end-to-end ASR are regarded as similar transformation tasks where the input phonetic information is converted into textual information. Our method utilizes the P2G conversion task for pre-training of a decoder network in Transformer encoder-decoder based end-to-end ASR. Experiments using 4 billion tokens of Web text demonstrates that the performance of ASR on out-of-domain tasks can be significantly improved by our pre-training.


 DOI: 10.21437/Interspeech.2020-1930

Cite as: Masumura, R., Makishima, N., Ihori, M., Takashima, A., Tanaka, T., Orihashi, S. (2020) Phoneme-to-Grapheme Conversion Based Large-Scale Pre-Training for End-to-End Automatic Speech Recognition. Proc. Interspeech 2020, 2822-2826, DOI: 10.21437/Interspeech.2020-1930.


@inproceedings{Masumura2020,
  author={Ryo Masumura and Naoki Makishima and Mana Ihori and Akihiko Takashima and Tomohiro Tanaka and Shota Orihashi},
  title={{Phoneme-to-Grapheme Conversion Based Large-Scale Pre-Training for End-to-End Automatic Speech Recognition}},
  year=2020,
  booktitle={Proc. Interspeech 2020},
  pages={2822--2826},
  doi={10.21437/Interspeech.2020-1930},
  url={http://dx.doi.org/10.21437/Interspeech.2020-1930}
}