Speech recognition researchers must often compare the performance of two recognizers or classifiers. In particular, they must assess the statistical significance of any differences exhibited in the performance of two algorithms. Gillick and Cox  have addressed this issue a-nd proposed using McNemar's test to compare recognizer performance. This test is based on the classification decisions made by each recognizer. In this paper, we propose several significance tests based on the confidence level the recognizer as well. In particular, we consider paired versions of the sign test, the signed-rank test and the t-test using as statistics the log probability assigned to the correct class and the difference in log probabilities between the correct class and the highest-scoring class besides the correct class. We show that for a phoneme recognition task involving over 2000 fricatives excised from continuous speech, the retention of this information yields a larger probability of obtaining statistically signficant results than does McNemar's test for a test set of a given size.
Bibliographic reference. Marcus, Jeffrey N. (1989): "Significance tests for comparing speech recognizer performance using small test sets", In EUROSPEECH-1989, 2465-2468.