Ninth International Conference on Spoken Language Processing

Pittsburgh, PA, USA
September 17-21, 2006

Vocal Emotion Recognition with Cochlear Implants

Xin Luo, Qian-Jie Fu, John J. Galvin III

House Ear Institute, USA

Besides conveying linguistic information, spoken language can also transmit important cues regarding the emotion of a talker. These prosodic cues are most strongly coded by changes in amplitude, pitch, speech rate, voice quality and articulation. The present study investigated the ability of cochlear implant (CI) users to recognize vocal emotions, as well as the relative contributions of spectral and temporal cues to vocal emotion recognition. An English sentence database was recorded for the experiment; each test sentence was produced according to five target emotions. Vocal emotion recognition was tested in 6 CI and 6 normal-hearing (NH) subjects. With unprocessed speech, NH listeners’ mean vocal emotion recognition performance was 90% correct, while CI users’ mean performance was only 45% correct. Vocal emotion recognition was also measured in NH subjects while listening to acoustic, sine-wave vocoder CI simulations. To test the contribution of spectral cues to vocal emotion recognition, 1-, 2-, 4-, 8- and 16-channel CI processors were simulated; to test the contribution of temporal cues, the temporal envelope filter cutoff frequency in each channel was either 50 or 500 Hz. Results showed that both spectral and temporal cues significantly contributed to performance. With the 50-Hz envelope filter, performance generally improved as the number of spectral channels was increased. With the 500-Hz envelope filter, performance significantly improved only when the spectral resolution was increased from 1 to 2, and then from 2 to 16 channels. For all but the 16-channel simulations, increasing the envelope filter cutoff frequency from 50 Hz to 500 Hz significantly improved performance. CI users’ vocal emotion recognition performance was statistically similar to that of NH subjects listening to 1-8 spectral channels with the 50-Hz envelope filter, and to 1 channel with the 500-Hz envelope filter. The results suggest that, while spectral cues may contribute more strongly to recognition of linguistic information, temporal cues may contribute more strongly to recognition of emotional content coded in spoken language.

Full Paper

Bibliographic reference.  Luo, Xin / Fu, Qian-Jie / Galvin III, John J. (2006): "Vocal emotion recognition with cochlear implants", In INTERSPEECH-2006, paper 1315-Wed2BuP.11.