Improved Speech Enhancement Using TCN with Multiple Encoder-Decoder Layers

Vinith Kishore, Nitya Tiwari, Periyasamy Paramasivam

A deep learning based time domain single-channel speech enhancement technique using multilayer encoder-decoder and a temporal convolutional network is proposed for use in applications such as smart speakers and voice assistants. The technique uses encoder-decoder with convolutional layers for obtaining representation suitable for speech enhancement and a temporal convolutional network (TCN) based separator between the encoder and decoder to learn long-range dependencies. The technique derives inspiration from speech separation techniques that use TCN based separator between a single layer encoder-decoder. We propose to use a multilayer encoder-decoder to obtain a noise-independent representation useful for separating clean speech and noise. We present t-SNE-based analysis of the representation learned using different architectures for selecting the optimal number of encoder-decoder layers. We evaluate the proposed architectures using an objective measure of speech quality, scale-invariant source-to-noise ratio, and by obtaining word error rate on a speech recognition platform. The proposed two-layer encoder-decoder architecture resulted in 48% improvement in WER over unprocessed noisy data and 33% and 44% improvement in WER over two baselines.

 DOI: 10.21437/Interspeech.2020-3122

Cite as: Kishore, V., Tiwari, N., Paramasivam, P. (2020) Improved Speech Enhancement Using TCN with Multiple Encoder-Decoder Layers. Proc. Interspeech 2020, 4531-4535, DOI: 10.21437/Interspeech.2020-3122.

  author={Vinith Kishore and Nitya Tiwari and Periyasamy Paramasivam},
  title={{Improved Speech Enhancement Using TCN with Multiple Encoder-Decoder Layers}},
  booktitle={Proc. Interspeech 2020},