Parallel and cascaded deep neural networks for text-to-speech synthesis

Manuel Sam Ribeiro, Oliver Watts, Junichi Yamagishi

An investigation of cascaded and parallel deep neural networks for speech synthesis is conducted. In these systems, suprasegmental linguistic features (syllable-level and above) are processed separately from segmental features (phone-level and below). The suprasegmental component of the networks learns compact distributed representations of high-level linguistic units without any segmental influence. These representations are then integrated into a frame-level system using a cascaded or a parallel approach. In the cascaded network, suprasegmental representations are used as input to the framelevel network. In the parallel network, segmental and suprasegmental features are processed separately and concatenated at a later stage. These experiments are conducted with a standard set of high-dimensional linguistic features as well as a hand-pruned one. It is observed that hierarchical systems are consistently preferred over the baseline feedforward systems. Similarly, parallel networks are preferred over cascaded networks.

DOI: 10.21437/SSW.2016-17

Cite as

Sam Ribeiro, M., Watts, O., Yamagishi, J. (2016) Parallel and cascaded deep neural networks for text-to-speech synthesis. Proc. 9th ISCA Speech Synthesis Workshop, 100-105.

@inproceedings{Sam Ribeiro+2016,
author={Manuel Sam Ribeiro and Oliver Watts and Junichi Yamagishi},
title={Parallel and cascaded deep neural networks for text-to-speech synthesis},
booktitle={9th ISCA Speech Synthesis Workshop},