September 22-25, 1997
In this paper, an RNN-based spectral model is proposed to generate spectral parameters for Mandarin text-to-speech(TTS). The RNN is employed to learn the relations between the linguistic features and the spectral parameters. The phoneme-to-spectral parameter rules and the coarticulation rules between each two adjacent phones are automatically learned and memorized into the weights of RNN. The synthesized speech sounds more fluent and smooth. The RNN is divided into two parts. The first part is synchronized with syllable and is expected to simulate the phoneme-to-spectral parameter rules. The second part is synchronized with frame and is expected to simulate the coarticulation rules between each two adjacent phones. The line spectrum pair(LSP) parameters and the normalized energy contour are taken as target value. Training with large database, the synthetic LSP and energy contour match to the original LSP and energy contours quite well. Moreover, an RNN-based prosodic model which was proposed in our previous study was combined to the spectral model to efficiently simulate the spectral and prosodic information generation. Lastly, the LPC-based Mandarin TTS is implemented to examine the performance of our spectral model. The synthetic speech sounds fluent and natural. The coarticulation effect between each two adjacent phones which makes synthesized speech sounds un- fluent and echo-like was improved. However, due to the simple structure of LPC-based synthesizer, the clarity of synthetic speech can be improved by using the other spectral parameter as target value. For example, the modify mel-cepstrum parameter[5, 6, 7] or the FFT- based spectral parameter can also be learned by RNN and synthesizes more clarity speech. This is a initial work on the RNN-based spectral model for text-to-speech. Some advantages of our spectral model can be found. First, large memory space of synthesis unit in traditional TTS is replaced by small memory space of RNN's weights. Second, the coarticulation effect can be alleviated and produces more fluent speech. Third, the RNN-based prosodic and spectral information generator[8, 9] can be easily combined to formed a more compact RNN-based TTS system.
Bibliographic reference. Hwang, Shaw-Hwa / Chen, Sin-Horng / Chang, Saga (1997): "An RNN-based spectral information generation for Mandarin text-to-speech", In EUROSPEECH-1997, 549-552.