Optimizing Voice Activity Detection for Noisy Conditions

Ruixi Lin, Charles Costello, Charles Jankowski, Vishwas Mruthyunjaya

In this work, we focus our attention on how to improve Voice Activity Detection (VAD) in noisy conditions. We propose a Convolutional Neural Network (CNN) based model, as well as a Denoising Autoencoder (DAE), and experiment against acoustic features and their delta features in noise levels ranging from SNR 35 dB to 0 dB. The experiments compare and find the best model configuration for robust performance in noisy conditions. We observe that combining more expressive audio features with the use of DAEs improve accuracy, especially as noise increases. At 0 dB, the proposed model trained with the best feature set could achieve a lab test accuracy of 93.2% (averaged across all noise levels) and 88.6% inference accuracy on device. We also compress the neural network and deploy the inference model that is optimized for the app so that the average on-device CPU usage is reduced to 14% from 37%.

 DOI: 10.21437/Interspeech.2019-1776

Cite as: Lin, R., Costello, C., Jankowski, C., Mruthyunjaya, V. (2019) Optimizing Voice Activity Detection for Noisy Conditions. Proc. Interspeech 2019, 2030-2034, DOI: 10.21437/Interspeech.2019-1776.

  author={Ruixi Lin and Charles Costello and Charles Jankowski and Vishwas Mruthyunjaya},
  title={{Optimizing Voice Activity Detection for Noisy Conditions}},
  booktitle={Proc. Interspeech 2019},