Achieving Multi-Accent ASR via Unsupervised Acoustic Model Adaptation

M.A. Tu─čtekin Turan, Emmanuel Vincent, Denis Jouvet

Current automatic speech recognition (ASR) systems trained on native speech often perform poorly when applied to non-native or accented speech. In this work, we propose to compute x-vector-like accent embeddings and use them as auxiliary inputs to an acoustic model trained on native data only in order to improve the recognition of multi-accent data comprising native, non-native, and accented speech. In addition, we leverage untranscribed accented training data by means of semi-supervised learning. Our experiments show that acoustic models trained with the proposed accent embeddings outperform those trained with conventional i-vector or x-vector speaker embeddings, and achieve a 15% relative word error rate (WER) reduction on non-native and accented speech w.r.t. acoustic models trained with regular spectral features only. Semi-supervised training using just 1 hour of untranscribed speech per accent yields an additional 15% relative WER reduction w.r.t. models trained on native data only.

 DOI: 10.21437/Interspeech.2020-2742

Cite as: Turan, M.T., Vincent, E., Jouvet, D. (2020) Achieving Multi-Accent ASR via Unsupervised Acoustic Model Adaptation. Proc. Interspeech 2020, 1286-1290, DOI: 10.21437/Interspeech.2020-2742.

  author={M.A. Tu─čtekin Turan and Emmanuel Vincent and Denis Jouvet},
  title={{Achieving Multi-Accent ASR via Unsupervised Acoustic Model Adaptation}},
  booktitle={Proc. Interspeech 2020},