Adaptive Speaker Normalization for CTC-Based Speech Recognition

Fenglin Ding, Wu Guo, Bin Gu, Zhen-Hua Ling, Jun Du


In this paper, we propose a new speaker normalization technique for acoustic model adaptation in connectionist temporal classification (CTC)-based automatic speech recognition. In the proposed method, for the inputs of a hidden layer, the mean and variance of each activation are first estimated at the speaker level. Then, we normalize each speaker representation independently by making them follow a standard normal distribution. Furthermore, we propose using an auxiliary network to dynamically generate the scaling and shifting parameters of speaker normalization, and an attention mechanism is introduced to improve performance. The experiments are conducted on the public Chinese dataset AISHELL-1. Our proposed methods present high effectiveness in adapting the CTC model, achieving up to 17.5% character error rate improvement over the speaker-independent (SI) model.


 DOI: 10.21437/Interspeech.2020-1390

Cite as: Ding, F., Guo, W., Gu, B., Ling, Z., Du, J. (2020) Adaptive Speaker Normalization for CTC-Based Speech Recognition. Proc. Interspeech 2020, 1266-1270, DOI: 10.21437/Interspeech.2020-1390.


@inproceedings{Ding2020,
  author={Fenglin Ding and Wu Guo and Bin Gu and Zhen-Hua Ling and Jun Du},
  title={{Adaptive Speaker Normalization for CTC-Based Speech Recognition}},
  year=2020,
  booktitle={Proc. Interspeech 2020},
  pages={1266--1270},
  doi={10.21437/Interspeech.2020-1390},
  url={http://dx.doi.org/10.21437/Interspeech.2020-1390}
}