EUROSPEECH '89

Speech recognition researchers must often compare the performance of two recognizers or classifiers. In particular, they must assess the statistical significance of any differences exhibited in the performance of two algorithms. Gillick and Cox [1] have addressed this issue and proposed using McNemar's test to compare recognizer performance. This test is based on the classification decisions made by each recognizer. In this paper, we propose several significance tests based on the confidence level the recognizer as well. In particular, we consider paired versions of the sign test, the signedrank test and the ttest using as statistics the log probability assigned to the correct class and the difference in log probabilities between the correct class and the highestscoring class besides the correct class. We show that for a phoneme recognition task involving over 2000 fricatives excised from continuous speech, the retention of this information yields a larger probability of obtaining statistically signficant results than does McNemar's test for a test set of a given size.
Bibliographic reference. Marcus, Jeffrey N. (1989): "Significance tests for comparing speech recognizer performance using small test sets", In EUROSPEECH1989, 24652468.