A New Framework for Supervised Speech Enhancement in the Time Domain

Ashutosh Pandey, DeLiang Wang

This work proposes a new learning framework that uses a loss function in the frequency domain to train a convolutional neural network (CNN) in the time domain. At the training time, an extra operation is added after the speech enhancement network to convert the estimated signal in the time domain to the frequency domain. This operation is differentiable and is used to train the system with a loss in the frequency domain. This proposed approach replaces learning in the frequency domain, i.e., short-time Fourier transform (STFT) magnitude estimation, with learning in the original time domain. The proposed method is a spectral mapping approach in which the CNN first generates a time domain signal then computes its STFT that is used for spectral mapping. This way the CNN can exploit the additional domain knowledge about calculating the STFT magnitude from the time domain signal. Experimental results demonstrate that the proposed method substantially outperforms the other methods of speech enhancement. The proposed approach is easy to implement and applicable to related speech processing tasks that require spectral mapping or time-frequency (T-F) masking.

 DOI: 10.21437/Interspeech.2018-1223

Cite as: Pandey, A., Wang, D. (2018) A New Framework for Supervised Speech Enhancement in the Time Domain. Proc. Interspeech 2018, 1136-1140, DOI: 10.21437/Interspeech.2018-1223.

  author={Ashutosh Pandey and DeLiang Wang},
  title={A New Framework for Supervised Speech Enhancement in the Time Domain},
  booktitle={Proc. Interspeech 2018},