Investigation on Bandwidth Extension for Speaker Recognition

Phani Sankar Nidadavolu, Cheng-I Lai, Jesús Villalba, Najim Dehak

In this work, we investigate training speaker recognition systems on wideband (WB) features and compare their performance with narrowband (NB) baselines. NIST speaker recognition evaluations have mainly driven speaker recognition research in the past years. Because of the target application of these evaluations, most data available to train speaker recognition systems is NB telephone speech. Meanwhile, WB data have been more scarce not being enough to train factor analysis and PLDA models. Thus, the usual practice when dealing with WB speech consists in downsampling the signal to 8 kHz, which implies potential loss of useful information. Instead, we experimented upsampling the training telephone data and leaving the WB data unchanged. We adopt two techniques to upsample telephone data: (1) using a feed-forward neural network, termed Bandwidth Extension (BWE) network, to predict WB features given NB features as input; and (2) using basic upsampling with a low-pass filter interpolator. While the former intends to estimate the high frequency information, the latter does not. The upsampled features are used to train state-of-the art i-vector and recently proposed x-vector models. We evaluated the systems on Speakers In The Wild (SITW) database obtaining 11.5% relative improvement in detection cost function (DCF) with x-vector model.

 DOI: 10.21437/Interspeech.2018-2394

Cite as: Nidadavolu, P.S., Lai, C., Villalba, J., Dehak, N. (2018) Investigation on Bandwidth Extension for Speaker Recognition. Proc. Interspeech 2018, 1111-1115, DOI: 10.21437/Interspeech.2018-2394.

  author={Phani Sankar Nidadavolu and Cheng-I Lai and Jesús Villalba and Najim Dehak},
  title={Investigation on Bandwidth Extension for Speaker Recognition},
  booktitle={Proc. Interspeech 2018},