A hybrid harmonics-and-bursts modelling approach to speech synthesis

Jonas Beskow, Harald Berthelsen

Statistical speech synthesis systems rely on a parametric speech generation model, typically some sort of vocoder. Vocoders are great for voiced speech because they offer independent control over voice source (e.g. pitch) and vocal tract filter (e.g. vowel quality) through control parameters that typically vary smoothly in time and lend themselves well to statistical modelling. Voiceless sounds and transients such as plosives and fricatives on the other hand exhibit fundamentally different spectro-temporal behaviour. Here the benefits of the vocoder are not as clear. In this paper, we investigate a hybrid approach to modeling the speech signal, where speech is decomposed into an harmonic part and a noise burst part through spectrogram kernel filtering. The harmonic part is modeled using vocoder and statistical parameter generation, while the burst part is modeled by concatenation. The two channels are then mixed together to form the final synthesized waveform. The proposed method was compared against a state of the art statistical speech synthesis system (HTS 2.3) in a perceptual evaluation, which reveled that the harmonics plus bursts method was perceived as significantly more natural than the purely statistical variant.

DOI: 10.21437/SSW.2016-34

Cite as

Beskow, J., Berthelsen, H. (2016) A hybrid harmonics-and-bursts modelling approach to speech synthesis. Proc. 9th ISCA Speech Synthesis Workshop, 208-213.

author={Jonas Beskow and Harald Berthelsen},
title={A hybrid harmonics-and-bursts modelling approach to speech synthesis},
booktitle={9th ISCA Speech Synthesis Workshop},