CAM: Uninteresting Speech Detector

Weiyi Lu, Yi Xu, Peng Yang, Belinda Zeng

Voice assistants such as Siri, Alexa, etc. usually adopt a pipeline to process users’ utterances, which generally include transcribing the audio into text, understanding the text, and finally responding back to users. One potential issue is that some utterances could be devoid of any interesting speech, and are thus not worth being processed through the entire pipeline. Examples of uninteresting utterances include those that have too much noise, are devoid of intelligible speech, etc. It is therefore desirable to have a model to filter out such useless utterances before they are ingested for downstream processing, thus saving system resources. Towards this end, we propose the Combination of Audio and Metadata (CAM) detector to identify utterances that contain only uninteresting speech. Our experimental results show that the CAM detector considerably outperforms using either an audio model or a metadata model alone, which demonstrates the effectiveness of the proposed system.

 DOI: 10.21437/Interspeech.2020-1192

Cite as: Lu, W., Xu, Y., Yang, P., Zeng, B. (2020) CAM: Uninteresting Speech Detector. Proc. Interspeech 2020, 681-685, DOI: 10.21437/Interspeech.2020-1192.

  author={Weiyi Lu and Yi Xu and Peng Yang and Belinda Zeng},
  title={{CAM: Uninteresting Speech Detector}},
  booktitle={Proc. Interspeech 2020},