Emitting Word Timings with End-to-End Models

Tara N. Sainath, Ruoming Pang, David Rybach, Basi García, Trevor Strohman


Having end-to-end (E2E) models emit the start and end times of words on-device is important for various applications. This unsolved problem presents challenges with respect to model size, latency and accuracy. In this paper, we present an approach to word timings by constraining the attention head of the Listen, Attend, Spell (LAS) 2nd-pass rescorer [1]. On a Voice-Search task, we show that this approach does not degrade accuracy compared to when no attention head is constrained. In addition, it meets on-device size and latency constraints. In comparison, constraining the alignment with a 1st-pass Recurrent Neural Network Transducer (RNN-T) model to emit word timings results in quality degradation. Furthermore, a low-frame-rate conventional acoustic model [2], which is trained with a constrained alignment and is used in many applications for word timings, is slower to detect start and end times compared to our proposed 2nd-pass LAS approach.


 DOI: 10.21437/Interspeech.2020-1059

Cite as: Sainath, T.N., Pang, R., Rybach, D., García, B., Strohman, T. (2020) Emitting Word Timings with End-to-End Models. Proc. Interspeech 2020, 3615-3619, DOI: 10.21437/Interspeech.2020-1059.


@inproceedings{Sainath2020,
  author={Tara N. Sainath and Ruoming Pang and David Rybach and Basi García and Trevor Strohman},
  title={{Emitting Word Timings with End-to-End Models}},
  year=2020,
  booktitle={Proc. Interspeech 2020},
  pages={3615--3619},
  doi={10.21437/Interspeech.2020-1059},
  url={http://dx.doi.org/10.21437/Interspeech.2020-1059}
}