Metadata-Aware End-to-End Keyword Spotting

Hongyi Liu, Apurva Abhyankar, Yuriy Mishchenko, Thibaud Sénéchal, Gengshen Fu, Brian Kulis, Noah D. Stein, Anish Shah, Shiv Naga Prasad Vitaladevuni


As a crucial part of Alexa products, our on-device keyword spotting system detects the wakeword in conversation and initiates subsequent user-device interactions. Convolutional neural networks (CNNs) have been widely used to model the relationship between time and frequency in the audio spectrum. However, it is not obvious how to appropriately leverage the rich descriptive information from device state metadata (such as player state, device type, volume, etc) in a CNN architecture. In this paper, we propose to use metadata information as an additional input feature to improve the performance of a single CNN keyword -spotting model under different conditions. We design a new network architecture for metadata-aware end-to-end keyword spotting which learns to convert the categorical metadata to a fixed length embedding, and then uses the embedding to: 1) modulate convolutional feature maps via conditional batch normalization, and 2) contribute to the fully connected layer via feature concatenation. The experiment shows that the proposed architecture is able to learn the meta-specific characteristics from combined datasets, and the best candidate achieves an average relative false reject rate (FRR) improvement of 14.63% at the same false accept rate (FAR) compared with CNN that does not use device state metadata.


 DOI: 10.21437/Interspeech.2020-1262

Cite as: Liu, H., Abhyankar, A., Mishchenko, Y., Sénéchal, T., Fu, G., Kulis, B., Stein, N.D., Shah, A., Vitaladevuni, S.N.P. (2020) Metadata-Aware End-to-End Keyword Spotting. Proc. Interspeech 2020, 2282-2286, DOI: 10.21437/Interspeech.2020-1262.


@inproceedings{Liu2020,
  author={Hongyi Liu and Apurva Abhyankar and Yuriy Mishchenko and Thibaud Sénéchal and Gengshen Fu and Brian Kulis and Noah D. Stein and Anish Shah and Shiv Naga Prasad Vitaladevuni},
  title={{Metadata-Aware End-to-End Keyword Spotting}},
  year=2020,
  booktitle={Proc. Interspeech 2020},
  pages={2282--2286},
  doi={10.21437/Interspeech.2020-1262},
  url={http://dx.doi.org/10.21437/Interspeech.2020-1262}
}