September 22-25, 1997
A technique for predicting triphones by concatenation of diphone or monophone models is studied. The models are connected using linear interpolation between endpoints of piece-wise linear parameter trajectories. Three types of spectral representation are compared: formants, filter amplitudes and cepstrum coefficients. The proposed technique lowers the spectral distortion of the phones for all three representations when different speakers are used for training and evaluation. The average error of the created triphones is lower in the filter and cepstrum domains than for formants. This is explained to be caused by limitations in the Analysis-by-Synthesis formant tracking algorithm. A small improvement with the proposed technique is achieved for all representations in the task of reordering N-best sentence recognition candidate lists.
Bibliographic reference. Blomberg, Mats (1997): "Creating unseen triphones by phone concatenation in the spectral, cepstral and formant domains", In EUROSPEECH-1997, 1187-1190.