Speech Prosody 2008
In this paper, a joint prosodic and spectral modeling framework is proposed instead of traditional score-domain fusion approaches to alleviate the problem of mismatch channel/handset/ambient noise. The basic idea is to embed the concept of hierarchical structure of speech prosody into an ergodic HMM (EHMM), and model the prosodic status transitions and prosodic/spectral features by EHMMís states, state transition probabilities and state-dependent observation distributions, respectively. Experimental results evaluated on the standard single-speaker detection task of NIST 2001 speaker recognition evaluation (NIST-SRE 2001) showed that the proposed approach not only outperformed the spectral feature-based baseline (8.04% vs. 8.64% in equal error rate, EER) but also worked a little bit better than score-domain fusion ( 8.44%) approach.
Bibliographic reference. Liao, Yuan-Fu / Chang, Wen-Chieh / Xie, Zong-You / Zeng, Ding-Yun / Juang, Yau-Tarng (2008): "Joint prosodic and spectral modeling for robust speaker verification", In SP-2008, 143-146.