A Mask-Based Model for Mandarin Chinese Polyphone Disambiguation

Haiteng Zhang, Huashan Pan, Xiulin Li

Polyphone disambiguation serves as an essential part of Mandarin text-to-speech (TTS) system. However, conventional system modelling the entire Pinyin set causes the case that prediction belongs to the unrelated polyphonic character instead of the current input one, which has negative impacts on TTS performance. To address this issue, we introduce a mask-based model for polyphone disambiguation. The model takes a mask vector extracted from the context as an extra input. In our model, the mask vector not only acts as a weighting factor in Weighted-softmax to prevent the case of mis-prediction but also eliminates the contribution of non-candidate set to the overall loss. Moreover, to mitigate the uneven distribution of pronunciation, we introduce a new loss called Modified Focal Loss. The experimental result shows the effectiveness of the proposed mask-based model. We also empirically studied the impact of Weighted-softmax and Modified Focal Loss. It was found that Weighted-softmax can effectively prevent the model from predicting outside the candidate set. Besides, Modified Focal Loss can reduce the adverse impacts of the uneven distribution of pronunciation.

 DOI: 10.21437/Interspeech.2020-1142

Cite as: Zhang, H., Pan, H., Li, X. (2020) A Mask-Based Model for Mandarin Chinese Polyphone Disambiguation. Proc. Interspeech 2020, 1728-1732, DOI: 10.21437/Interspeech.2020-1142.

  author={Haiteng Zhang and Huashan Pan and Xiulin Li},
  title={{A Mask-Based Model for Mandarin Chinese Polyphone Disambiguation}},
  booktitle={Proc. Interspeech 2020},