Joint Grapheme and Phoneme Embeddings for Contextual End-to-End ASR

Zhehuai Chen, Mahaveer Jain, Yongqiang Wang, Michael L. Seltzer, Christian Fuegen

End-to-end approaches to automatic speech recognition, such as Listen-Attend-Spell (LAS), blend all components of a traditional speech recognizer into a unified model. Although this simplifies training and decoding pipelines, a unified model is hard to adapt when mismatch exists between training and test data, especially if this information is dynamically changing. The Contextual LAS (CLAS) framework tries to solve this problem by encoding contextual entities into fixed-dimensional embeddings and utilizing an attention mechanism to model the probabilities of seeing these entities. In this work, we improve the CLAS approach by proposing several new strategies to extract embeddings for the contextual entities. We compare these embedding extractors based on graphemic and phonetic input and/or output sequences and show that an encoder-decoder model trained jointly towards graphemes and phonemes outperforms other approaches. Leveraging phonetic information obtains better discrimination for similarly written graphemic sequences and also helps the model generalize better to graphemic sequences unseen in training. We show significant improvements over the original CLAS approach and also demonstrate that the proposed method scales much better to a large number of contextual entities across multiple domains.

 DOI: 10.21437/Interspeech.2019-1434

Cite as: Chen, Z., Jain, M., Wang, Y., Seltzer, M.L., Fuegen, C. (2019) Joint Grapheme and Phoneme Embeddings for Contextual End-to-End ASR. Proc. Interspeech 2019, 3490-3494, DOI: 10.21437/Interspeech.2019-1434.

  author={Zhehuai Chen and Mahaveer Jain and Yongqiang Wang and Michael L. Seltzer and Christian Fuegen},
  title={{Joint Grapheme and Phoneme Embeddings for Contextual End-to-End ASR}},
  booktitle={Proc. Interspeech 2019},