Raw Sign and Magnitude Spectra for Multi-Head Acoustic Modelling

Erfan Loweimi, Peter Bell, Steve Renals


In this paper we investigate the usefulness of the sign spectrum and its combination with the raw magnitude spectrum in acoustic modelling for automatic speech recognition (ASR). The sign spectrum is a sequence of ±1s, capturing one bit of the phase spectrum. It encodes information overlooked by the magnitude spectrum enabling unique signal characterisation and reconstruction. In particular, we demonstrate it carries information related to the temporal structure of the signal as well as the speech’s source component. Furthermore, we investigate the usefulness of combining it with the raw magnitude spectrum via multi-head CNNs at different fusion levels for ASR. While information-wise these two streams of information are together equivalent to the raw waveform signal the overall performance is noticeably higher than raw waveform and classic features such as MFCC and filterbank. This has been observed and verified in TIMIT, NTIMT, Aurora-4 and WSJ tasks and up to 14.5% relative WER reduction has been achieved.


 DOI: 10.21437/Interspeech.2020-0018

Cite as: Loweimi, E., Bell, P., Renals, S. (2020) Raw Sign and Magnitude Spectra for Multi-Head Acoustic Modelling. Proc. Interspeech 2020, 1644-1648, DOI: 10.21437/Interspeech.2020-0018.


@inproceedings{Loweimi2020,
  author={Erfan Loweimi and Peter Bell and Steve Renals},
  title={{Raw Sign and Magnitude Spectra for Multi-Head Acoustic Modelling}},
  year=2020,
  booktitle={Proc. Interspeech 2020},
  pages={1644--1648},
  doi={10.21437/Interspeech.2020-0018},
  url={http://dx.doi.org/10.21437/Interspeech.2020-0018}
}