RNN-BLSTM Based Multi-Pitch Estimation

Jianshu Zhang, Jian Tang, Li-Rong Dai

Multi-pitch estimation is critical in many applications, including computational auditory scene analysis (CASA), speech enhancement/separation and mixed speech analysis; however, despite much effort, it remains a challenging problem. This paper uses the PEFAC algorithm to extract features and proposes the use of recurrent neural networks with bidirectional Long Short-Term Memory (RNN-BLSTM) to model the two pitch contours of a mixture of two speech signals. Compared with feed-forward deep neural networks (DNN), which are trained on static frame-level acoustic features, RNN-BLSTM is trained on sequential frame-level features and has more power to learn pitch contour temporal dynamics. The results of evaluations using a speech dataset containing mixtures of two speech signals demonstrate that RNN-BLSTM can substantially outperform DNN in multi-pitch estimation of mixed speech signals.

DOI: 10.21437/Interspeech.2016-117

Cite as

Zhang, J., Tang, J., Dai, L. (2016) RNN-BLSTM Based Multi-Pitch Estimation. Proc. Interspeech 2016, 1785-1789.

author={Jianshu Zhang and Jian Tang and Li-Rong Dai},
title={RNN-BLSTM Based Multi-Pitch Estimation},
booktitle={Interspeech 2016},