Speech activity detection (or endpointing) is an important processing step for applications such as speech recognition, language identification and speaker diarization. Both audio- and vision-based approaches have been used for this task in various settings, often tailored toward end applications. However, much of the prior work reports results in synthetic settings, on task-specific datasets, or on datasets that are not openly available. This makes it difficult to compare approaches and understand their strengths and weaknesses. In this paper, we describe a new dataset which we will release publicly containing densely labeled speech activity in YouTube videos, with the goal of creating a shared, available dataset for this task. The labels in the dataset annotate three different speech activity conditions: clean speech, speech co-occurring with music and speech co-occurring with noise, which enable analysis of model performance in more challenging conditions based on the presence of overlapping noise. We report benchmark performance numbers on AVA-Speech using off-the-shelf, state-of-the-art audio and vision models that serve as a baseline to facilitate future research.
DOI: 10.21437/Interspeech.2018-2028
Cite as: Chaudhuri, S., Roth, J., Ellis, D.P.W., Gallagher, A., Kaver, L., Marvin, R., Pantofaru, C., Reale, N., Guarino Reid, L., Wilson, K., Xi, Z. (2018) AVA-Speech: A Densely Labeled Dataset of Speech Activity in Movies. Proc. Interspeech 2018, 1239-1243, DOI: 10.21437/Interspeech.2018-2028.
@inproceedings{Chaudhuri2018, author={Sourish Chaudhuri and Joseph Roth and Daniel P. W. Ellis and Andrew Gallagher and Liat Kaver and Radhika Marvin and Caroline Pantofaru and Nathan Reale and Loretta {Guarino Reid} and Kevin Wilson and Zhonghua Xi}, title={AVA-Speech: A Densely Labeled Dataset of Speech Activity in Movies}, year=2018, booktitle={Proc. Interspeech 2018}, pages={1239--1243}, doi={10.21437/Interspeech.2018-2028}, url={http://dx.doi.org/10.21437/Interspeech.2018-2028} }