Enhancing Sequence-to-Sequence Text-to-Speech with Morphology

Jason Taylor, Korin Richmond


Neural sequence-to-sequence (S2S) modelling encodes a single, unified representation for each input sequence. When used for text-to-speech synthesis (TTS), such representations must embed ambiguities between English spelling and pronunciation. For example, in pothole and there the character sequence th sounds different. This can be problematic when predicting pronunciation directly from letters. We posit pronunciation becomes easier to predict when letters are grouped into sub-word units like morphemes (e.g. a boundary lies between t and h in pothole but not there). Moreover, morphological boundaries can reduce the total number of, and increase the counts of, seen unit subsequences. Accordingly, we test here the effect of augmenting input sequences of letters with morphological boundaries. We find morphological boundaries substantially lower the Word and Phone Error Rates (WER and PER) for a Bi-LSTM performing G2P on one hand, and also increase the naturalness scores of Tacotrons performing TTS in a MUSHRA listening test on the other. The improvements to TTS quality are such that grapheme input augmented with morphological boundaries outperforms phone input without boundaries. Since morphological segmentation may be predicted with high accuracy, we highlight this simple pre-processing step has important potential for S2S modelling in TTS.


 DOI: 10.21437/Interspeech.2020-1547

Cite as: Taylor, J., Richmond, K. (2020) Enhancing Sequence-to-Sequence Text-to-Speech with Morphology. Proc. Interspeech 2020, 1738-1742, DOI: 10.21437/Interspeech.2020-1547.


@inproceedings{Taylor2020,
  author={Jason Taylor and Korin Richmond},
  title={{Enhancing Sequence-to-Sequence Text-to-Speech with Morphology}},
  year=2020,
  booktitle={Proc. Interspeech 2020},
  pages={1738--1742},
  doi={10.21437/Interspeech.2020-1547},
  url={http://dx.doi.org/10.21437/Interspeech.2020-1547}
}