Learning Better Speech Representations by Worsening Interference

Jun Wang

Can better representations be learnt from worse interfering scenarios? To verify this seeming paradox, we propose a novel framework that performed compositional learning in traditionally independent tasks of speech separation and speaker identification. In this framework, generic pre-training and compositional fine-tuning are proposed to mimic the bottom-up and top-down processes of a human’s cocktail party effect. Moreover, we investigate schemes to prohibit the model from ending up learning an easier identity-prediction task. Substantially discriminative and generalizable representations can be learnt in severely interfering conditions. Experiment results on downstream tasks show that our learnt representations have superior discriminative power than a standard speaker verification method. Meanwhile, RISE achieves higher SI-SNRi consistently in different inference modes over DPRNN, a state-of-the-art speech separation system.

 DOI: 10.21437/Interspeech.2020-1545

Cite as: Wang, J. (2020) Learning Better Speech Representations by Worsening Interference. Proc. Interspeech 2020, 2632-2636, DOI: 10.21437/Interspeech.2020-1545.

  author={Jun Wang},
  title={{Learning Better Speech Representations by Worsening Interference}},
  booktitle={Proc. Interspeech 2020},