Direct Expressive Voice Training Based on Semantic Selection

Igor Jauk, Antonio Bonafonte

This work aims at creating expressive voices from audiobooks using semantic selection. First, for each utterance of the audiobook an acoustic feature vector is extracted, including iVectors built on MFCC and on F0 basis. Then, the transcription is projected into a semantic vector space. A seed utterance is projected to the semantic vector space and the N nearest neighbors are selected. The selection is then filtered by selecting only acoustically similar data.

The proposed technique can be used to train emotional voices by using emotional keywords or phrases as seeds, obtaining training data semantically similar to the seed. It can also be used to read larger texts in an expressive manner, creating specific voices for each sentence. That later application is compared to a DNN predictor, which predicts acoustic features from semantic features. The selected data is used to adapt statistical speech synthesis models. The performance of the technique is analyzed objectively and in a perceptive experiment. In the first part of the experiment, subjects clearly show preference for particular expressive voices to synthesize semantically expressive utterances. In the second part, the proposed method is shown to achieve similar or better performance than the DNN based prediction.

DOI: 10.21437/Interspeech.2016-979

Cite as

Jauk, I., Bonafonte, A. (2016) Direct Expressive Voice Training Based on Semantic Selection. Proc. Interspeech 2016, 3181-3185.

author={Igor Jauk and Antonio Bonafonte},
title={Direct Expressive Voice Training Based on Semantic Selection},
booktitle={Interspeech 2016},