Sparse Mixture of Local Experts for Efficient Speech Enhancement

Aswin Sivaraman, Minje Kim


This work proposes a novel approach for reducing the computational complexity of speech denoising neural networks by using a sparsely active ensemble topology. In our ensemble networks, a gating module classifies an input noisy speech signal either by identifying speaker gender or by estimating signal degradation, and exclusively assigns it to a best-case specialist module, optimized to denoise a particular subset of the training data. This approach extends the hypothesis that speech denoising can be simplified if it is split into non-overlapping subproblems, contrasting earlier approaches that train large generalist neural networks to address a wide range of noisy speech data. We compare a baseline recurrent network against an ensemble of similarly designed, but smaller networks. Each network module is trained independently and combined to form a naïve ensemble. This can be further fine-tuned using a sparsity parameter to improve performance. Our experiments on noisy speech data — generated by mixing LibriSpeech and MUSAN datasets — demonstrate that a fine-tuned sparsely active ensemble can outperform a generalist using significantly fewer calculations. The key insight of this paper, leveraging model selection as a form of network compression, may be used to supplement already-existing deep learning methods for speech denoising.


 DOI: 10.21437/Interspeech.2020-2989

Cite as: Sivaraman, A., Kim, M. (2020) Sparse Mixture of Local Experts for Efficient Speech Enhancement. Proc. Interspeech 2020, 4526-4530, DOI: 10.21437/Interspeech.2020-2989.


@inproceedings{Sivaraman2020,
  author={Aswin Sivaraman and Minje Kim},
  title={{Sparse Mixture of Local Experts for Efficient Speech Enhancement}},
  year=2020,
  booktitle={Proc. Interspeech 2020},
  pages={4526--4530},
  doi={10.21437/Interspeech.2020-2989},
  url={http://dx.doi.org/10.21437/Interspeech.2020-2989}
}