ISCA Tutorial and Research Workshop on Statistical and Perceptual Audition (SAPA2006)

Pittsburgh, PA, USA
September 16, 2006

LSM-Based Feature Extraction for Concatenative Speech Synthesis

Jerome R. Bellegarda

Speech & Language Technologies, Apple Computer Inc., Cupertino, CA, USA

In modern concatenative synthesis, unit selection is normally cast as a multivariate optimization task, yet comprehensively encapsulating the underlying problem of perceptual audition into a rich enough mathematical framework remains a major challenge. Objective functions typically considered to quantify acoustic discontinuities, for example, do not closely reflect users’ perception of the concatenated waveform. This paper considers an alternative feature extraction paradigm, which eschews general purpose Fourier analysis in favor of a modal decomposition separately optimized for each boundary region. The ensuing transform preserves, by construction, those properties of the signal which are globally relevant to each concatenation considered. This leads to a join cost strategy which jointly, albeit implicitly, accounts for both interframe incoherence and discrepancies in formant frequencies/bandwidths. Systematic listening tests underscore the viability of the proposed approach in accounting for the perception of discontinuity between acoustic units.

Full Paper

Bibliographic reference.  Bellegarda, Jerome R. (2006): "LSM-based feature extraction for concatenative speech synthesis", In SAPA-2006, 59-64.