13th Annual Conference of the International Speech Communication Association

Portland, OR, USA
September 9-13, 2012

Speech/Nonspeech Segmentation in Web Videos

Ananya Misra

Google, New York, NY, USA

Speech transcription of web videos requires first detecting segments with transcribable speech. We refer to this as segmentation. Commonly used segmentation techniques are inadequate for domains such as YouTube, where videos may have a large variety of background and recording conditions. In this work, we investigate alternative audio features and a discriminative classifier, which together yield a lower frame error rate (25.3%) on YouTube videos compared to the commonly used Gaussian mixture models trained on cepstral features (30.6%). The alternative audio features perform particularly well in noisy conditions.

Index Terms: segmentation, speech detection, voice activity detection, video

Full Paper

Bibliographic reference.  Misra, Ananya (2012): "Speech/nonspeech segmentation in web videos", In INTERSPEECH-2012, 1977-1980.