Low-Latency Sequence-to-Sequence Speech Recognition and Translation by Partial Hypothesis Selection

Danni Liu, Gerasimos Spanakis, Jan Niehues


Encoder-decoder models provide a generic architecture for sequence-to-sequence tasks such as speech recognition and translation. While offline systems are often evaluated on quality metrics like word error rates (WER) and BLEU scores, latency is also a crucial factor in many practical use-cases. We propose three latency reduction techniques for chunk-based incremental inference and evaluate their accuracy-latency tradeoff. On the 300-hour How2 dataset, we reduce latency by 83% to 0.8 second by sacrificing 1% WER (6% rel.) compared to offline transcription. Although our experiments use the Transformer, the partial hypothesis selection strategies are applicable to other encoder-decoder models. To reduce expensive re-computation as new chunks arrive, we propose to use a unidirectionally-attending encoder. After an adaptation procedure to partial sequences, the unidirectional model performs on par with the original model. We further show that our approach is also applicable to speech translation. On the How2 English-Portuguese speech translation dataset, we reduce latency to 0.7 second (-84% rel.) while incurring a loss of 2.4 BLEU points (5% rel.) compared to the offline system.


 DOI: 10.21437/Interspeech.2020-2897

Cite as: Liu, D., Spanakis, G., Niehues, J. (2020) Low-Latency Sequence-to-Sequence Speech Recognition and Translation by Partial Hypothesis Selection. Proc. Interspeech 2020, 3620-3624, DOI: 10.21437/Interspeech.2020-2897.


@inproceedings{Liu2020,
  author={Danni Liu and Gerasimos Spanakis and Jan Niehues},
  title={{Low-Latency Sequence-to-Sequence Speech Recognition and Translation by Partial Hypothesis Selection}},
  year=2020,
  booktitle={Proc. Interspeech 2020},
  pages={3620--3624},
  doi={10.21437/Interspeech.2020-2897},
  url={http://dx.doi.org/10.21437/Interspeech.2020-2897}
}