Sixth ISCA Workshop on Speech Synthesis

Bonn, Germany
August 22-24, 2007

Spectral Conversion based on Statistical Models Including Time-Sequence Matching

Yoshihiko Nankaku (1), Kenichi Nakamura (1), Tomoki Toda (2), Keiichi Tokuda (1)

(1) Nagoya Institute of Technology, Gokiso-cho, Showa-ku, Nagoya, Aichi, Japan
(2) Nara Institute of Science and Technology, 8916-5 Takayama, Ikoma, Nara, Japan

This paper proposes a spectral conversion technique based on a new statistical model which includes time-sequence matching. In conventional GMM-based approaches, the Dynamic Programming (DP) matching between source and target feature sequences is performed prior to the training of GMMs. Although a similarity measure of two frames, e.g., the Euclid distance is typically adopted, this might be inappropriate for converting the spectral features. The likelihood function of the proposed model can directly deal with two different length sequences, in which a frame alignment of source and target feature sequences is represented by discrete hidden variables. In the proposed algorithm, the maximum likelihood criterion is consistently applied to the training of model parameters, sequence matching and spectral conversion. In the subjective preference test, the proposed method is superior than the conventional GMM-based method.

