Correlational Networks for Speaker Normalization in Automatic Speech Recognition

Rini A Sharon, Sandeep Reddy Kothinti, Umesh Srinivasan

In this paper, we propose using common representation learning(CRL) for speaker normalization in automatic speech recognition (ASR). Conventional methods like feature space maximum likelihood linear regression (fMLLR) require two pass decode and their performance is often limited by the amount of data during test. While i-vectors do not require two-pass decode, a significant number of input frames are required for estimation. Hence, as an alternative, a regression model employing correlational neural networks (CorrNet) for multi-view CRL is proposed. In this approach, the CorrNet training methodology treats normalized and un-normalized features as two parallel views of the same speech data. Once trained, this network generates frame-wise fMLLR-like features, thus overcoming the limitations of fMLLR/i-vectors. The recognition accuracy using the proposed CorrNet-generated features is comparable with the i-vector model counterparts and significantly better than the un-normalized features like filterbank. With CorrNet-features, we get an absolute improvement in word error rate of 2.5% for TIMIT, 2.69% for WSJ84 and 3.2% for Switchboard-33hour over un-normalized features.

 DOI: 10.21437/Interspeech.2018-1612

Cite as: Sharon, R.A., Kothinti, S.R., Srinivasan, U. (2018) Correlational Networks for Speaker Normalization in Automatic Speech Recognition. Proc. Interspeech 2018, 882-886, DOI: 10.21437/Interspeech.2018-1612.

  author={Rini A Sharon and Sandeep Reddy Kothinti and Umesh Srinivasan},
  title={Correlational Networks for Speaker Normalization in Automatic Speech Recognition},
  booktitle={Proc. Interspeech 2018},