Noise Tokens: Learning Neural Noise Templates for Environment-Aware Speech Enhancement

Haoyu Li, Junichi Yamagishi


In recent years, speech enhancement (SE) has achieved impressive progress with the success of deep neural networks (DNNs). However, the DNN approach usually fails to generalize well to unseen environmental noise that is not included in the training. To address this problem, we propose “noise tokens” (NTs), which are a set of neural noise templates that are jointly trained with the SE system. NTs dynamically capture the environment variability and thus enable the DNN model to handle various environments to produce STFT magnitude with higher quality. Experimental results show that using NTs is an effective strategy that consistently improves the generalization ability of SE systems across different DNN architectures. Furthermore, we investigate applying a state-of-the-art neural vocoder to generate waveform instead of traditional inverse STFT (ISTFT). Subjective listening tests show the residual noise can be significantly suppressed through mel-spectrogram correction and vocoder-based waveform synthesis.


 DOI: 10.21437/Interspeech.2020-1030

Cite as: Li, H., Yamagishi, J. (2020) Noise Tokens: Learning Neural Noise Templates for Environment-Aware Speech Enhancement. Proc. Interspeech 2020, 2452-2456, DOI: 10.21437/Interspeech.2020-1030.


@inproceedings{Li2020,
  author={Haoyu Li and Junichi Yamagishi},
  title={{Noise Tokens: Learning Neural Noise Templates for Environment-Aware Speech Enhancement}},
  year=2020,
  booktitle={Proc. Interspeech 2020},
  pages={2452--2456},
  doi={10.21437/Interspeech.2020-1030},
  url={http://dx.doi.org/10.21437/Interspeech.2020-1030}
}