Text-Dependent Speech Enhancement for Small-Footprint Robust Keyword Detection

Meng Yu, Xuan Ji, Yi Gao, Lianwu Chen, Jie Chen, Jimeng Zheng, Dan Su, Dong Yu

Keyword detection (KWD), also known as keyword spotting, is in great demand in small devices in the era of Internet of Things. Albeit recent progresses, the performance of KWD, measured in terms of precision and recall rate, may still degrade significantly when either the non-speech ambient noises or the human voice and speech-like interferences (e.g., TV, background competing talkers) exists. In this paper, we propose a general solution to address all kinds of environmental interferences. A novel text-dependent speech enhancement (TDSE) technique using a recurrent neural network (RNN) with long short-term memory (LSTM) is presented for improving the robustness of the small-footprint KWD task in the presence of environmental noises and interfering talkers. On our large simulated and recorded noisy and far-field evaluation sets, we show that TDSE significantly improves the quality of the target keyword speech and performs particularly well under speech interference conditions. We demonstrate that KWD with TDSE frontend significantly outperforms the baseline KWD system with or without a generic speech enhancement in terms of equal error rate (EER) in the keyword detection evaluation.

 DOI: 10.21437/Interspeech.2018-1668

Cite as: Yu, M., Ji, X., Gao, Y., Chen, L., Chen, J., Zheng, J., Su, D., Yu, D. (2018) Text-Dependent Speech Enhancement for Small-Footprint Robust Keyword Detection. Proc. Interspeech 2018, 2613-2617, DOI: 10.21437/Interspeech.2018-1668.

  author={Meng Yu and Xuan Ji and Yi Gao and Lianwu Chen and Jie Chen and Jimeng Zheng and Dan Su and Dong Yu},
  title={Text-Dependent Speech Enhancement for Small-Footprint Robust Keyword Detection},
  booktitle={Proc. Interspeech 2018},