Hybrid Transformer/CTC Networks for Hardware Efficient Voice Triggering

Saurabh Adya, Vineet Garg, Siddharth Sigtia, Pramod Simha, Chandra Dhir

We consider the design of two-pass voice trigger detection systems. We focus on the networks in the second pass that are used to re-score candidate segments obtained from the first-pass. Our baseline is an acoustic model(AM), with BiLSTM layers, trained by minimizing the CTC loss. We replace the BiLSTM layers with self-attention layers. Results on internal evaluation sets show that self-attention networks yield better accuracy while requiring fewer parameters. We add an auto-regressive decoder network on top of the self-attention layers and jointly minimize the CTC loss on the encoder and the cross-entropy loss on the decoder. This design yields further improvements over the baseline. We retrain all the models above in a multi-task learning(MTL) setting, where one branch of a shared network is trained as an AM, while the second branch classifies the whole sequence to be true-trigger or not. Results demonstrate that networks with self-attention layers yield ~60% relative reduction in false reject rates for a given false-alarm rate, while requiring 10% fewer parameters. When trained in the MTL setup, self-attention networks yield further accuracy improvements. On-device measurements show that we observe 70% relative reduction in inference time. Additionally, the proposed network architectures are ~5× faster to train.

 DOI: 10.21437/Interspeech.2020-1330

Cite as: Adya, S., Garg, V., Sigtia, S., Simha, P., Dhir, C. (2020) Hybrid Transformer/CTC Networks for Hardware Efficient Voice Triggering. Proc. Interspeech 2020, 3351-3355, DOI: 10.21437/Interspeech.2020-1330.

  author={Saurabh Adya and Vineet Garg and Siddharth Sigtia and Pramod Simha and Chandra Dhir},
  title={{Hybrid Transformer/CTC Networks for Hardware Efficient Voice Triggering}},
  booktitle={Proc. Interspeech 2020},