Adaptive Neural Speech Enhancement with a Denoising Variational Autoencoder

Yoshiaki Bando, Kouhei Sekiguchi, Kazuyoshi Yoshii


This paper presents a neural speech enhancement method that has a statistical feedback mechanism based on a denoising variational autoencoder (VAE). Deep generative models of speech signals have been combined with unsupervised noise models for enhancing speech robustly regardless of the condition mismatch from the training data. This approach, however, often yields unnatural speech-like noise due to the unsuitable prior distribution on the latent speech representations. To mitigate this problem, we use a denoising VAE whose encoder estimates the latent vectors of clean speech from an input mixture signal. This encoder network is utilized as a prior distribution of the probabilistic generative model of the input mixture, and its condition mismatch is handled in a Bayesian manner. The speech signal is estimated by updating the latent vectors to fit the input mixture while noise is estimated by a nonnegative matrix factorization model. To efficiently train the encoder network, we also propose a multi-task learning of the denoising VAE with the standard mask-based enhancement. The experimental results show that our method outperforms the existing mask-based and generative enhancement methods in unknown conditions.


 DOI: 10.21437/Interspeech.2020-2291

Cite as: Bando, Y., Sekiguchi, K., Yoshii, K. (2020) Adaptive Neural Speech Enhancement with a Denoising Variational Autoencoder. Proc. Interspeech 2020, 2437-2441, DOI: 10.21437/Interspeech.2020-2291.


@inproceedings{Bando2020,
  author={Yoshiaki Bando and Kouhei Sekiguchi and Kazuyoshi Yoshii},
  title={{Adaptive Neural Speech Enhancement with a Denoising Variational Autoencoder}},
  year=2020,
  booktitle={Proc. Interspeech 2020},
  pages={2437--2441},
  doi={10.21437/Interspeech.2020-2291},
  url={http://dx.doi.org/10.21437/Interspeech.2020-2291}
}