Multi-Scale TCN: Exploring Better Temporal DNN Model for Causal Speech Enhancement

Lu Zhang, Mingjiang Wang


Capturing the temporal dependence of speech signals is of great importance for numerous speech related tasks. This paper proposes a more effective temporal modeling method for causal speech enhancement system. We design a forward stacked temporal convolutional network (TCN) model which exploits multi-scale temporal analysis in each residual block. This model incorporates a multi-scale dilated convolution to better track the target speech through its context information from past frames. Applying multi-target learning of log power spectrum (LPS) and ideal ratio mask (IRM) further improves model robustness, due to the complementarity among the tasks. Experimental results show that the proposed TCN model not only performs better speech reconstruction ability in terms of speech quality and speech intelligibility, but also has smaller model size than that of long short-term memory (LSTM) network and the gated recurrent units (GRU) network.


 DOI: 10.21437/Interspeech.2020-1104

Cite as: Zhang, L., Wang, M. (2020) Multi-Scale TCN: Exploring Better Temporal DNN Model for Causal Speech Enhancement. Proc. Interspeech 2020, 2672-2676, DOI: 10.21437/Interspeech.2020-1104.


@inproceedings{Zhang2020,
  author={Lu Zhang and Mingjiang Wang},
  title={{Multi-Scale TCN: Exploring Better Temporal DNN Model for Causal Speech Enhancement}},
  year=2020,
  booktitle={Proc. Interspeech 2020},
  pages={2672--2676},
  doi={10.21437/Interspeech.2020-1104},
  url={http://dx.doi.org/10.21437/Interspeech.2020-1104}
}