Self-Supervised Pre-Training with Acoustic Configurations for Replay Spoofing Detection

Hye-jin Shim, Hee-Soo Heo, Jee-weon Jung, Ha-Jin Yu


Constructing a dataset for replay spoofing detection requires a physical process of playing an utterance and re-recording it, presenting a challenge to the collection of large-scale datasets. In this study, we propose a self-supervised framework for pre-training acoustic configurations using datasets published for other tasks, such as speaker verification. Here, acoustic configurations refer to the environmental factors generated during the process of voice recording but not the voice itself, including microphone types, place and ambient noise levels. Specifically, we select pairs of segments from utterances and train deep neural networks to determine whether the acoustic configurations of the two segments are identical. We validate the effectiveness of the proposed method based on the ASVspoof 2019 physical access dataset utilizing two well-performing systems. The experimental results demonstrate that the proposed method outperforms the baseline approach by 30%.


 DOI: 10.21437/Interspeech.2020-1345

Cite as: Shim, H., Heo, H., Jung, J., Yu, H. (2020) Self-Supervised Pre-Training with Acoustic Configurations for Replay Spoofing Detection. Proc. Interspeech 2020, 1091-1095, DOI: 10.21437/Interspeech.2020-1345.


@inproceedings{Shim2020,
  author={Hye-jin Shim and Hee-Soo Heo and Jee-weon Jung and Ha-Jin Yu},
  title={{Self-Supervised Pre-Training with Acoustic Configurations for Replay Spoofing Detection}},
  year=2020,
  booktitle={Proc. Interspeech 2020},
  pages={1091--1095},
  doi={10.21437/Interspeech.2020-1345},
  url={http://dx.doi.org/10.21437/Interspeech.2020-1345}
}