Fast and Slow Acoustic Model

Kshitiz Kumar, Emilian Stoimenov, Hosam Khalil, Jian Wu

In this work we layout a Fast & Slow (F&S) acoustic model (AM) in an encoder-decoder architecture for streaming automatic speech recognition (ASR). The Slow model represents our baseline ASR model; it’s significantly larger than Fast model and provides stronger accuracy. The Fast model is generally developed for related speech applications. It has weaker ASR accuracy but is faster to evaluate and consequently leads to better user-perceived latency. We propose a joint F&S model that encodes output state information from Fast model, feeds that to Slow model to improve overall model accuracy from F&S AM. We demonstrate scenarios where individual Fast and Slow models are already available to build the joint F&S model. We apply our work on a large vocabulary ASR task. Compared to Slow AM, our Fast AM is 3–4× smaller and 11.5% relatively weaker in ASR accuracy. The proposed F&S AM achieves 4.7% relative gain over the Slow AM. We also report a progression of techniques and improve the relative gain to 8.1% by encoding additional Fast AM outputs. Our proposed framework has generic attributes — we demonstrate a specific extension by encoding two Slow models to achieve 12.2% relative gain.

 DOI: 10.21437/Interspeech.2020-2887

Cite as: Kumar, K., Stoimenov, E., Khalil, H., Wu, J. (2020) Fast and Slow Acoustic Model. Proc. Interspeech 2020, 541-545, DOI: 10.21437/Interspeech.2020-2887.

  author={Kshitiz Kumar and Emilian Stoimenov and Hosam Khalil and Jian Wu},
  title={{Fast and Slow Acoustic Model}},
  booktitle={Proc. Interspeech 2020},