Singing Voice Extraction with Attention-Based Spectrograms Fusion

Hao Shi, Longbiao Wang, Sheng Li, Chenchen Ding, Meng Ge, Nan Li, Jianwu Dang, Hiroshi Seki


We propose a novel attention mechanism-based spectrograms fusion system with minimum difference masks (MDMs) estimation for singing voice extraction. Compared with previous works that use a fully connected neural network, our system takes advantage of the multi-head attention mechanism. Specifically, we 1) try a variety of embedding methods of multiple spectrograms as the input of attention mechanisms, which can provide multi-scale correlation information between adjacent frames in the spectrograms; 2) add a regular term to loss function to obtain better continuity of spectrogram; 3) use the phase of the linear fusion waveform to reconstruct the final waveform, which can reduce the impact of the inconsistent spectrogram. Experiments on the MIR-1K dataset show that our system consistently improves the quantitative evaluation by the perceptual evaluation of speech quality, signal-to-distortion ratio, signal-to-interference ratio, and signal-to-artifact ratio.


 DOI: 10.21437/Interspeech.2020-1043

Cite as: Shi, H., Wang, L., Li, S., Ding, C., Ge, M., Li, N., Dang, J., Seki, H. (2020) Singing Voice Extraction with Attention-Based Spectrograms Fusion. Proc. Interspeech 2020, 2412-2416, DOI: 10.21437/Interspeech.2020-1043.


@inproceedings{Shi2020,
  author={Hao Shi and Longbiao Wang and Sheng Li and Chenchen Ding and Meng Ge and Nan Li and Jianwu Dang and Hiroshi Seki},
  title={{Singing Voice Extraction with Attention-Based Spectrograms Fusion}},
  year=2020,
  booktitle={Proc. Interspeech 2020},
  pages={2412--2416},
  doi={10.21437/Interspeech.2020-1043},
  url={http://dx.doi.org/10.21437/Interspeech.2020-1043}
}