Effect of Adding Positional Information on Convolutional Neural Networks for End-to-End Speech Recognition

Jinhwan Park, Wonyong Sung


Attention-based models with convolutional encoders enable faster training and inference than recurrent neural network-based ones. However, convolutional models often require a very large receptive field to achieve high recognition accuracy, which not only increases the parameter size but also the computational cost and run-time memory footprint. A convolutional encoder with a short receptive field length can suffer from looping or skipping problems when the input utterance contains the same words as nearby sentences. We believe that this is due to the insufficient receptive field length, and try to remedy this problem by adding positional information to the convolution-based encoder. It is shown that the word error rate (WER) of a convolutional encoder with a short receptive field size can be reduced significantly by augmenting it with positional information. Visualization results are presented to demonstrate the effectiveness of adding positional information. The proposed method improves the accuracy of attention models with a convolutional encoder and achieves a WER of 10.60% on TED-LIUMv2 for an end-to-end speech recognition task.


 DOI: 10.21437/Interspeech.2020-3163

Cite as: Park, J., Sung, W. (2020) Effect of Adding Positional Information on Convolutional Neural Networks for End-to-End Speech Recognition. Proc. Interspeech 2020, 46-50, DOI: 10.21437/Interspeech.2020-3163.


@inproceedings{Park2020,
  author={Jinhwan Park and Wonyong Sung},
  title={{Effect of Adding Positional Information on Convolutional Neural Networks for End-to-End Speech Recognition}},
  year=2020,
  booktitle={Proc. Interspeech 2020},
  pages={46--50},
  doi={10.21437/Interspeech.2020-3163},
  url={http://dx.doi.org/10.21437/Interspeech.2020-3163}
}