Directly Comparing the Listening Strategies of Humans and Machines

Michael I. Mandel

In a given noisy environment, human listeners can more accurately identify spoken words than automatic speech recognizers. It is not clear, however, what information the humans are able to utilize in doing so that the machines are not. This paper uses a recently introduced technique to directly characterize the information used by humans and machines on the same task. The task was a forced choice between eight sentences spoken by a single talker from the small-vocabulary GRID corpus that were selected to be maximally confusable with one another. These sentences were mixed with “bubble” noise, which is designed to reveal randomly selected time-frequency glimpses of the sentence. Responses to these noisy mixtures allowed the identification of time-frequency regions that were important for each listener to recognize each sentence, i.e., regions that were frequently audible when a sentence was correctly identified and inaudible when it was not. In comparing these regions across human and machine listeners, we found that dips in noise allowed the humans to recognize words based on informative speech cues. In contrast, the baseline CHiME-2-GRID recognizer correctly identified sentences only when the time-frequency profile of the noisy mixture matched that of the underlying speech.

DOI: 10.21437/Interspeech.2016-932

Cite as

Mandel, M.I. (2016) Directly Comparing the Listening Strategies of Humans and Machines. Proc. Interspeech 2016, 660-664.

author={Michael I. Mandel},
title={Directly Comparing the Listening Strategies of Humans and Machines},
booktitle={Interspeech 2016},