High Quality Streaming Speech Synthesis with Low, Sentence-Length-Independent Latency

Nikolaos Ellinas, Georgios Vamvoukakis, Konstantinos Markopoulos, Aimilios Chalamandaris, Georgia Maniati, Panos Kakoulidis, Spyros Raptis, June Sig Sung, Hyoungmin Park, Pirros Tsiakoulis


This paper presents an end-to-end text-to-speech system with low latency on a CPU, suitable for real-time applications. The system is composed of an autoregressive attention-based sequence-to-sequence acoustic model and the LPCNet vocoder for waveform generation. An acoustic model architecture that adopts modules from both the Tacotron 1 and 2 models is proposed, while stability is ensured by using a recently proposed purely location-based attention mechanism, suitable for arbitrary sentence length generation. During inference, the decoder is unrolled and acoustic feature generation is performed in a streaming manner, allowing for a nearly constant latency which is independent from the sentence length. Experimental results show that the acoustic model can produce feature sequences with minimal latency about 31 times faster than real-time on a computer CPU and 6.5 times on a mobile CPU, enabling it to meet the conditions required for real-time applications on both devices. The full end-to-end system can generate almost natural quality speech, which is verified by listening tests.


 DOI: 10.21437/Interspeech.2020-2464

Cite as: Ellinas, N., Vamvoukakis, G., Markopoulos, K., Chalamandaris, A., Maniati, G., Kakoulidis, P., Raptis, S., Sung, J.S., Park, H., Tsiakoulis, P. (2020) High Quality Streaming Speech Synthesis with Low, Sentence-Length-Independent Latency. Proc. Interspeech 2020, 2022-2026, DOI: 10.21437/Interspeech.2020-2464.


@inproceedings{Ellinas2020,
  author={Nikolaos Ellinas and Georgios Vamvoukakis and Konstantinos Markopoulos and Aimilios Chalamandaris and Georgia Maniati and Panos Kakoulidis and Spyros Raptis and June Sig Sung and Hyoungmin Park and Pirros Tsiakoulis},
  title={{High Quality Streaming Speech Synthesis with Low, Sentence-Length-Independent Latency}},
  year=2020,
  booktitle={Proc. Interspeech 2020},
  pages={2022--2026},
  doi={10.21437/Interspeech.2020-2464},
  url={http://dx.doi.org/10.21437/Interspeech.2020-2464}
}