Multi-Task Network for Noise-Robust Keyword Spotting and Speaker Verification Using CTC-Based Soft VAD and Global Query Attention

Myunghun Jung, Youngmoon Jung, Jahyun Goo, Hoirin Kim


Keyword spotting (KWS) and speaker verification (SV) have been studied independently although it is known that acoustic and speaker domains are complementary. In this paper, we propose a multi-task network that performs KWS and SV simultaneously to fully utilize the interrelated domain information. The multi-task network tightly combines sub-networks aiming at performance improvement in challenging conditions such as noisy environments, open-vocabulary KWS, and short-duration SV, by introducing novel techniques of connectionist temporal classification (CTC)-based soft voice activity detection (VAD) and global query attention. Frame-level acoustic and speaker information is integrated with phonetically originated weights so that forms a word-level global representation. Then it is used for the aggregation of feature vectors to generate discriminative embeddings. Our proposed approach shows 4.06% and 26.71% relative improvements in equal error rate (EER) compared to the baselines for both tasks. We also present a visualization example and results of ablation experiments.


 DOI: 10.21437/Interspeech.2020-1420

Cite as: Jung, M., Jung, Y., Goo, J., Kim, H. (2020) Multi-Task Network for Noise-Robust Keyword Spotting and Speaker Verification Using CTC-Based Soft VAD and Global Query Attention. Proc. Interspeech 2020, 931-935, DOI: 10.21437/Interspeech.2020-1420.


@inproceedings{Jung2020,
  author={Myunghun Jung and Youngmoon Jung and Jahyun Goo and Hoirin Kim},
  title={{Multi-Task Network for Noise-Robust Keyword Spotting and Speaker Verification Using CTC-Based Soft VAD and Global Query Attention}},
  year=2020,
  booktitle={Proc. Interspeech 2020},
  pages={931--935},
  doi={10.21437/Interspeech.2020-1420},
  url={http://dx.doi.org/10.21437/Interspeech.2020-1420}
}