13th Annual Conference of the International Speech Communication Association

Portland, OR, USA
September 9-13, 2012

Turning a Monolingual Speaker into Multilingual for a Mixed-language TTS

Ji He (1,3), Yao Qian (1), Frank K. Soong (1), Sheng Zhao (2)

(1) Microsoft Research Asia
(2) Microsoft Search Technology Center Asia, Beijing, China
(3) Tsinghua University, Beijing, China

We propose an approach to render speech sentences of different languages out of a speaker's monolingual recordings for building mixed-coded TTS systems. The difference between two monolingual speakers' corpora, e.g. English and Chinese, are firstly equalized by warping spectral frequency, removing F0 variation and adjusting speaking rate across speakers and languages. The English speaker's Chinese speech is then rendered by a trajectory tilling approach. The Chinese speaker's parameter trajectories, which are equalized towards English speaker, are used to guide the search for the best sequence of 5ms waveform "tiles" in English speaker's recordings. The rendered English speaker's Chinese speech together with her own English recordings is finally used to train a mixed-language (English-Chinese) HMM-based TTS. Experimental results show that the proposed approach can synthesize high quality of mixed-language speech, which is also confirmed in both objective and subjective evaluations.

Index Terms: Mixed-language TTS, HMM-based TTS, Unit Selection, Trajectory Tiling

Full Paper     Audio Example

Bibliographic reference.  He, Ji / Qian, Yao / Soong, Frank K. / Zhao, Sheng (2012): "Turning a monolingual speaker into multilingual for a mixed-language TTS", In INTERSPEECH-2012, 963-966.