Identifying Important Time-Frequency Locations in Continuous Speech Utterances

Hassan Salami Kavaki, Michael I. Mandel

Human listeners use specific cues to recognize speech and recent experiments have shown that certain time-frequency regions of individual utterances are more important to their correct identification than others. A model that could identify such cues or regions from clean speech would facilitate speech recognition and speech enhancement by focusing on those important regions. Thus, in this paper we present a model that can predict the regions of individual utterances that are important to an automatic speech recognition (ASR) “listener” by learning to add as much noise as possible to these utterances while still permitting the ASR to correctly identify them. This work utilizes a continuous speech recognizer to recognize multi-word utterances and builds upon our previous work that performed the same process for an isolated word recognizer. Our experimental results indicate that our model can apply noise to obscure 90.5% of the spectrogram while leaving recognition performance nearly unchanged.

 DOI: 10.21437/Interspeech.2020-2637

Cite as: Kavaki, H.S., Mandel, M.I. (2020) Identifying Important Time-Frequency Locations in Continuous Speech Utterances. Proc. Interspeech 2020, 1639-1643, DOI: 10.21437/Interspeech.2020-2637.

  author={Hassan Salami Kavaki and Michael I. Mandel},
  title={{Identifying Important Time-Frequency Locations in Continuous Speech Utterances}},
  booktitle={Proc. Interspeech 2020},