Distant Supervision for Polyphone Disambiguation in Mandarin Chinese

Jiawen Zhang, Yuanyuan Zhao, Jiaqi Zhu, Jinba Xiao


Grapheme-to-phoneme (G2P) conversion plays an important role in building a Mandarin Chinese text-to-speech (TTS) system, where the polyphone disambiguation is an indispensable task. However, most of the previous polyphone disambiguation models are trained on manually annotated datasets, which are suffering from data scarcity, narrow coverage, and unbalanced data distribution. In this paper, we propose a framework that can predict the pronunciations of Chinese characters, and the core model is trained in a distantly supervised way. Specifically, we utilize the alignment procedure used for acoustic models to produce abundant character-phoneme sequence pairs, which are employed to train a Seq2Seq model with attention mechanism. We also make use of a language model that is trained on phoneme sequences to alleviate the impact of noises in the auto-generated dataset. Experimental results demonstrate that even without additional syntactic features and pre-trained embeddings, our approach achieves competitive prediction results, and especially improves the predictive accuracy for unbalanced polyphonic characters. In addition, compared with the manually annotated training datasets, the auto-generated one is more diversified and makes the results more consistent with the pronunciation habits of most people.


 DOI: 10.21437/Interspeech.2020-2427

Cite as: Zhang, J., Zhao, Y., Zhu, J., Xiao, J. (2020) Distant Supervision for Polyphone Disambiguation in Mandarin Chinese. Proc. Interspeech 2020, 1753-1757, DOI: 10.21437/Interspeech.2020-2427.


@inproceedings{Zhang2020,
  author={Jiawen Zhang and Yuanyuan Zhao and Jiaqi Zhu and Jinba Xiao},
  title={{Distant Supervision for Polyphone Disambiguation in Mandarin Chinese}},
  year=2020,
  booktitle={Proc. Interspeech 2020},
  pages={1753--1757},
  doi={10.21437/Interspeech.2020-2427},
  url={http://dx.doi.org/10.21437/Interspeech.2020-2427}
}