Stacked 1D Convolutional Networks for End-to-End Small Footprint Voice Trigger Detection

Takuya Higuchi, Mohammad Ghasemzadeh, Kisun You, Chandra Dhir


We propose a stacked 1D convolutional neural network (S1DCNN) for end-to-end small footprint voice trigger detection in a streaming scenario. Voice trigger detection is an important speech application, with which users can activate their devices by simply saying a keyword or phrase. Due to privacy and latency reasons, a voice trigger detection system should run on an always-on processor on device. Therefore, having small memory and compute cost is crucial for a voice trigger detection system. Recently, singular value decomposition filters (SVDFs) has been used for end-to-end voice trigger detection. The SVDFs approximate a fully-connected layer with a low rank approximation, which reduces the number of model parameters. In this work, we propose S1DCNN as an alternative approach for end-to-end small-footprint voice trigger detection. An S1DCNN layer consists of a 1D convolution layer followed by a depth-wise 1D convolution layer. We show that the SVDF can be expressed as a special case of the S1DCNN layer. Experimental results show that the S1DCNN achieve 19.0% relative false reject ratio (FRR) reduction with a similar model size and a similar time delay compared to the SVDF. By using longer time delays, the S1DCNN further improve the FRR up to 12.2% relative.


 DOI: 10.21437/Interspeech.2020-2763

Cite as: Higuchi, T., Ghasemzadeh, M., You, K., Dhir, C. (2020) Stacked 1D Convolutional Networks for End-to-End Small Footprint Voice Trigger Detection. Proc. Interspeech 2020, 2592-2596, DOI: 10.21437/Interspeech.2020-2763.


@inproceedings{Higuchi2020,
  author={Takuya Higuchi and Mohammad Ghasemzadeh and Kisun You and Chandra Dhir},
  title={{Stacked 1D Convolutional Networks for End-to-End Small Footprint Voice Trigger Detection}},
  year=2020,
  booktitle={Proc. Interspeech 2020},
  pages={2592--2596},
  doi={10.21437/Interspeech.2020-2763},
  url={http://dx.doi.org/10.21437/Interspeech.2020-2763}
}