Lightweight End-to-End Speech Recognition from Raw Audio Data Using Sinc-Convolutions

Ludwig Kürzinger, Nicolas Lindae, Palle Klewitz, Gerhard Rigoll


Many end-to-end Automatic Speech Recognition (ASR) systems still rely on pre-processed frequency-domain features that are handcrafted to emulate the human hearing. Our work is motivated by recent advances in integrated learnable feature extraction. For this, we propose Lightweight Sinc-Convolutions (LSC) that integrate Sinc-convolutions with depthwise convolutions as a low-parameter machine-learnable feature extraction for end-to-end ASR systems.

We integrated LSC into the hybrid CTC/attention architecture for evaluation. The resulting end-to-end model shows smooth convergence behaviour that is further improved by applying SpecAugment in the time domain. We also discuss filter-level improvements, such as using log-compression as activation function. Our model achieves a word error rate of 10.7% on the TEDlium v2 test dataset, surpassing the corresponding architecture with log-mel filterbank features by an absolute 1.9%, but only has 21% of its model size.


 DOI: 10.21437/Interspeech.2020-1392

Cite as: Kürzinger, L., Lindae, N., Klewitz, P., Rigoll, G. (2020) Lightweight End-to-End Speech Recognition from Raw Audio Data Using Sinc-Convolutions. Proc. Interspeech 2020, 1659-1663, DOI: 10.21437/Interspeech.2020-1392.


@inproceedings{Kürzinger2020,
  author={Ludwig Kürzinger and Nicolas Lindae and Palle Klewitz and Gerhard Rigoll},
  title={{Lightweight End-to-End Speech Recognition from Raw Audio Data Using Sinc-Convolutions}},
  year=2020,
  booktitle={Proc. Interspeech 2020},
  pages={1659--1663},
  doi={10.21437/Interspeech.2020-1392},
  url={http://dx.doi.org/10.21437/Interspeech.2020-1392}
}