Analysis of Pronunciation Learning in End-to-End Speech Synthesis

Jason Taylor, Korin Richmond

Ensuring correct pronunciation for the widest possible variety of text input is vital for deployed text-to-speech (TTS) systems. For languages such as English that do not have trivial spelling, systems have always relied heavily upon a lexicon, both for pronunciation lookup and for training letter-to-sound (LTS) models as a fall-back to handle out-of-vocabulary words (OOVs). In contrast, recently proposed models that are trained “end-to-end” (E2E) aim to avoid linguistic text analysis and any explicit phone representation, instead learning pronunciation implicitly as part of a direct mapping from input characters to speech audio. This might be termed implicit LTS. In this paper, we explore the nature of this approach by training explicit LTS models with datasets commonly used to build E2E systems. We compare their performance with LTS models trained on a high quality English lexicon. We find that LTS errors for words with ambiguous or unpredictable pronunciations are mirrored as mispronunciations by an E2E model. Overall, our analysis suggests that limited and unbalanced lexical coverage in E2E training data may pose significant confounding factors that complicate learning accurate pronunciations in a purely E2E system.

 DOI: 10.21437/Interspeech.2019-2830

Cite as: Taylor, J., Richmond, K. (2019) Analysis of Pronunciation Learning in End-to-End Speech Synthesis. Proc. Interspeech 2019, 2070-2074, DOI: 10.21437/Interspeech.2019-2830.

  author={Jason Taylor and Korin Richmond},
  title={{Analysis of Pronunciation Learning in End-to-End Speech Synthesis}},
  booktitle={Proc. Interspeech 2019},