Robust Multichannel Gender Classification from Speech in Movie Audio

Naveen Kumar, Md. Nasir, Panayiotis Georgiou, Shrikanth S. Narayanan

Speech in the form of scripted dialogues forms an important part of the audio signal in movies. However, it is often masked by background audio signals such as music, ambient noise or background chatter. These background sounds make even otherwise simple tasks, such as gender classification, challenging. Additionally, the variability in this noise across movies renders standard approaches to source separation or enhancement inadequate. Instead, we exploit multichannel information present in different language channels (English, Spanish, French) for each movie to improve the robustness of our gender classification system. We exploit the fact that the speaker labels of interest in this case co-occur in each language channel. We fuse the predictions obtained for each channel using Recognition Output Voting Error Reduction (ROVER) and show that this approach improves the gender accuracy by 7% absolute (11% relative) compared to the best independent prediction on any single channel. In the case of surround movies, we further investigate fusion of mono audio and front center channels which shows 5% and 3% absolute (8% and 4% relative) increase in accuracy compared to only using mono and front center channel, respectively.

DOI: 10.21437/Interspeech.2016-540

Cite as

Kumar, N., Nasir, M., Georgiou, P., Narayanan, S.S. (2016) Robust Multichannel Gender Classification from Speech in Movie Audio. Proc. Interspeech 2016, 2233-2237.

author={Naveen Kumar and Md. Nasir and Panayiotis Georgiou and Shrikanth S. Narayanan},
title={Robust Multichannel Gender Classification from Speech in Movie Audio},
booktitle={Interspeech 2016},