This paper uses an unconventional analysis as a tool to diagnose the problems with three different speech activity detection systems. The unconventional analysis is to score the frames in an audio file in order of confidence, starting with the frame that we have the most confidence in and progressing towards less and less confident frames. By keeping track of the cumulative number of errors, we can determine how the errors are distributed across the data. Using speech activity detection on highly degraded audio as a case example, we show how this simple analysis can yield useful insight into both system performance and the data itself. In our case example, we use the analysis to establish three main points. First, a small percentage of the frames account for a lionfs share of the errors. Second, three different systems perform very poorly on the same small subset of data . despite the fact that the systems adopt very different decoding algorithms and features. In other words, three very different systems agree on which data is ehardf. Third, the ehardf data is primarily characterized by its proximity to speech-nonspeech boundaries. Through follow-up analyses, we show that this phenomenon is not merely an artifact of ground truth inaccuracy, but rather a steady progression of the data becoming harder and harder to classify correctly as one moves closer to the boundaries. Through this case example, we demonstrate the utility of confidence-based scoring as a general diagnostic tool for detection tasks on time-series data.
Bibliographic reference. Tsai, T. J. / Janin, Adam (2013): "Confidence-based scoring: a useful diagnostic tool for detection tasks", In INTERSPEECH-2013, 737-741.