Phoneme Embedding and its Application to Speech Driven Talking Avatar Synthesis

Xu Li, Zhiyong Wu, Helen Meng, Jia Jia, Xiaoyan Lou, Lianhong Cai

Word embedding has made great achievements in many natural language processing tasks. However, the attempt to apply word embedding to the field of speech got few breakthroughs. The reason is that word vectors mainly contain semantic and syntactic information. Such high level features are difficult to be directly incorporated in speech related tasks compared to acoustic or phoneme related features. In this paper, we investigate the method for phoneme embedding to generate phoneme vectors carrying acoustic information for speech related tasks. One-hot representations of phoneme labels are fed into embedding layer to generate phoneme vectors that are then passed through bidirectional long short-term memory (BLSTM) recurrent neural network to predict acoustic features. Weights in embedding layer are updated through backpropagation during training. Analyses indicate that phonemes with similar acoustic pronunciations are close to each other in cosine distance in the generated phoneme vector space, and tend to be in the same category after k-means clustering. We evaluate the phoneme embedding by applying the generated phoneme vector into speech driven talking avatar synthesis. Experimental results indicate that adding phoneme vector as features can achieve 10.2% relative improvement in objective test.

DOI: 10.21437/Interspeech.2016-363

Cite as

Li, X., Wu, Z., Meng, H., Jia, J., Lou, X., Cai, L. (2016) Phoneme Embedding and its Application to Speech Driven Talking Avatar Synthesis. Proc. Interspeech 2016, 1472-1476.

author={Xu Li and Zhiyong Wu and Helen Meng and Jia Jia and Xiaoyan Lou and Lianhong Cai},
title={Phoneme Embedding and its Application to Speech Driven Talking Avatar Synthesis},
booktitle={Interspeech 2016},