12th Annual Conference of the International Speech Communication Association

Florence, Italy
August 27-31. 2011

Frame-Level Vocal Effort Likelihood Space Modeling for Improved Whisper-Island Detection

Chi Zhang, John H. L. Hansen

University of Texas at Dallas, USA

In this study, a frame-based vocal effort likelihood space modeling framework for improved whisper-island detection within normally phonated audio streams is proposed. The proposed method is based on first training a traditional Gaussian mixture model for whisper and neutral speech, which is then employed to extract a newly proposed discriminative feature set entitled Vocal Effort Likelihood (VEL), for whisper-island detection. The VEL feature set is integrated within a BIC/T2-BIC segmentation scheme for vocal effort change point (VECP) detection. With the dimension-reduced VEL 2-D feature set, the proposed framework has reduced computational costs versus prior method [1]. Experimental results using the UT-VocalEffort II corpus for whisper-island detection using the proposed framework are presented and compared with a previous algorithm introduced in [1]. The proposed algorithm is shown to improve performance in VECP detection with the lowest Multi- Error Score (MES) of 6.33. Furthermore, very accurate whisper-island detection was obtained using proposed algorithm, which is useful for sustained performance in speech systems (ASR, Speaker-ID, etc.) which might experience whisper speech. Finally, experimental performance achieves a 100% detection rate for the proposed algorithm, which represents the best whisper-island detection performance with lowest computational costs available in the literature to date.


  1. C. Zhang and J.H.L. Hansen, “An unsupervised effective algorithm for whisper-island detection,” IEEE Transactions on Audio, Speech, and Language Processing, vol. PP, no. 99, p. 1, 2010.

Full Paper

Bibliographic reference.  Zhang, Chi / Hansen, John H. L. (2011): "Frame-level vocal effort likelihood space modeling for improved whisper-island detection", In INTERSPEECH-2011, 2421-2424.