End-to-End Keyword Search Based on Attention and Energy Scorer for Low Resource Languages

Zeyu Zhao, Wei-Qiang Zhang


Keyword search (KWS) means searching for the keywords given by the user from continuous speech. Conventional KWS systems based on automatic speech recognition (ASR) decode input speech by ASR before searching for keywords. With deep neural network (DNN) becoming increasingly popular, some end-to-end (E2E) KWS emerged. The main advantage of E2E KWS is to avoid speech recognition. Since E2E KWS systems are at the very beginning, the performance is currently not as good as traditional methods, so there is still loads of work to do. To this end, we propose an E2E KWS model consists of four parts, including speech encoder-decoder, query encoder-decoder, attention mechanism and energy scorer. Different from the baseline system using auto-encoder to extract embeddings, the proposed model extracts embeddings that contain character sequence information by encode-decoder. Attention mechanism and a novel energy scorer are also introduced in the model, where the former can locate the keywords, and the latter can make the final decision. We train the models on low resource condition with only about 10-hour training data in various languages. The experiment results show that the proposed model outperforms the baseline system.


 DOI: 10.21437/Interspeech.2020-2613

Cite as: Zhao, Z., Zhang, W. (2020) End-to-End Keyword Search Based on Attention and Energy Scorer for Low Resource Languages. Proc. Interspeech 2020, 2587-2591, DOI: 10.21437/Interspeech.2020-2613.


@inproceedings{Zhao2020,
  author={Zeyu Zhao and Wei-Qiang Zhang},
  title={{End-to-End Keyword Search Based on Attention and Energy Scorer for Low Resource Languages}},
  year=2020,
  booktitle={Proc. Interspeech 2020},
  pages={2587--2591},
  doi={10.21437/Interspeech.2020-2613},
  url={http://dx.doi.org/10.21437/Interspeech.2020-2613}
}