State Sequence Pooling Training of Acoustic Models for Keyword Spotting

Kuba Łopatka, Tobias Bocklet


We propose a new training method to improve HMM-based keyword spotting. The loss function is based on a score computed with the keyword/filler model from the entire input sequence. It is equivalent to max/attention pooling but is based on prior acoustic knowledge. We also employ a multi-task learning setup by predicting both LVCSR and keyword posteriors. We compare our model to a baseline trained on frame-wise cross entropy, with and without per-class weighting. We employ a low-footprint TDNN for acoustic modeling. The proposed training yields significant and consistent improvement over the baseline in adverse noise conditions. The FRR on cafeteria noise is reduced from 13.07% to 5.28% at 9 dB SNR and from 37.44% to 6.78% at 5 dB SNR. We obtain these results with only 600 unique training keyword samples. The training method is independent of the frontend and acoustic model topology.


 DOI: 10.21437/Interspeech.2020-2722

Cite as: Łopatka, K., Bocklet, T. (2020) State Sequence Pooling Training of Acoustic Models for Keyword Spotting. Proc. Interspeech 2020, 4338-4342, DOI: 10.21437/Interspeech.2020-2722.


@inproceedings{Łopatka2020,
  author={Kuba Łopatka and Tobias Bocklet},
  title={{State Sequence Pooling Training of Acoustic Models for Keyword Spotting}},
  year=2020,
  booktitle={Proc. Interspeech 2020},
  pages={4338--4342},
  doi={10.21437/Interspeech.2020-2722},
  url={http://dx.doi.org/10.21437/Interspeech.2020-2722}
}