Multilingual Speech Recognition with Self-Attention Structured Parameterization

Yun Zhu, Parisa Haghani, Anshuman Tripathi, Bhuvana Ramabhadran, Brian Farris, Hainan Xu, Han Lu, Hasim Sak, Isabel Leal, Neeraj Gaur, Pedro J. Moreno, Qian Zhang

Multilingual automatic speech recognition systems can transcribe utterances from different languages. These systems are attractive from different perspectives: they can provide quality improvements, specially for lower resource languages, and simplify the training and deployment procedure. End-to-end speech recognition has further simplified multilingual modeling as one model, instead of several components of a classical system, have to be unified. In this paper, we investigate a streamable end-to-end multilingual system based on the Transformer Transducer [1]. We propose several techniques for adapting the self-attention architecture based on the language id. We analyze the trade-offs of each method with regards to quality gains and number of additional parameters introduced. We conduct experiments in a real-world task consisting of five languages. Our experimental results demonstrate ~8% to ~20% relative gain over the baseline multilingual model.

 DOI: 10.21437/Interspeech.2020-2847

Cite as: Zhu, Y., Haghani, P., Tripathi, A., Ramabhadran, B., Farris, B., Xu, H., Lu, H., Sak, H., Leal, I., Gaur, N., Moreno, P.J., Zhang, Q. (2020) Multilingual Speech Recognition with Self-Attention Structured Parameterization. Proc. Interspeech 2020, 4741-4745, DOI: 10.21437/Interspeech.2020-2847.

  author={Yun Zhu and Parisa Haghani and Anshuman Tripathi and Bhuvana Ramabhadran and Brian Farris and Hainan Xu and Han Lu and Hasim Sak and Isabel Leal and Neeraj Gaur and Pedro J. Moreno and Qian Zhang},
  title={{Multilingual Speech Recognition with Self-Attention Structured Parameterization}},
  booktitle={Proc. Interspeech 2020},