2nd International Workshop on Speech, Language and Audio in Multimedia (SLAM2014)
Speaker recognition in mobile devices suffers from poor performance in noisy environments, necessitating the use of noisesuppression algorithms. These typically apply time-frequency masks optimised on the signal statistics to the noisy signal spectrum, suppressing the noise components while preserving the speech. Studies in the field of speech recognition demonstrate that ideal time-frequency masks (i.e. masks generated based on ideal knowledge of the speech and noise spectra) improve the recognition rate even at very poor signal-to-noiseratios (SNRs). The effects of such masking on the performance of speaker recognition systems are studied here, to gain a better understanding of pre-processing that is beneficial for automated speaker recognition. Two masking approaches are considered: the ideal binary mask and the ideal Wiener filter. We demonstrate that such ideal noise suppression significantly improves the recognition rate over the unprocessed system. As any noise suppression algorithm involves a trade-off between noise modulation and speech attenuation artefacts, the relative effect of these artefacts on speaker recognition performance is analysed next. We show that speech attenuation has a larger influence on the performance as compared to noise modulation at typical SNR values. Thus, we conclude, preserving speech even at the cost of lower noise suppression (and, consequently, larger noise modulation) is beneficial to speaker recognition. This conclusion is further validated.
Index Terms: Speaker recognition, ideal binary mask, ideal Wiener filter, noise suppression
Bibliographic reference. Madhu, Nilesh / Jung, Sung Kyo (2014): "Speaker recognition performance under ideal-knowledge noise suppression: an investigation", In SLAM-2014, 48-52.