Third International Conference on Spoken Language Processing (ICSLP 94)
When generating synthetic speech by unit concatenation the structure and the representation of the unit inventory is a major issue. Therefore, a mixed inventory structure was developed and compared with a demisyllable and a diphone inventory in a perception experiment. The results show that indeed some of the disadvantages of the two standard methods vanish when the mixed inventory structure is used. The complete inventory consists of 2182 units for the German language. The inventory was recorded with 32 kHz sampling rate. A pair comparison test confirmed a noticeable increase in naturalness compared to synthetic speech sampled with 16 kHz. However, a segmental intelligibility test showed no differences between the two sampling rates. Both error rates are comparable to those obtained with natural speech. To find the representation and manipulation algorithm that yields the best overall quality of synthetic speech, TD-PSOLA, LP-PSOLA, and simple RELP were compared. Although only a very simple LP ana-lysis was performed, no perceptual differences between TD-PSOLA and simple RELP could be found, while LP-PSOLA speech was judged worse than the other versions.
Bibliographic reference. Portele, Thomas / Höfer, Florian / Hess, Wolfgang J. (1994): "Structure and representation of an inventory for German speech synthesis", In ICSLP-1994, 1759-1762.