Do GMM Phoneme Classifiers Perceive Synthetic Sibilants as Humans Do?

Gábor Pintér, Hiroki Watanabe

This study presents a psycholinguistically motivated evaluation method for phoneme classifiers by using non-categorical perceptual data elicited in a Japanese sibilant matching 2AFC task. Probability values of a perceptual [s]-[ʃ] boundary, obtained from 42 speakers over a 7-step synthetic [s]-[ʃ] continuum, were compared to probability estimates of Gaussian mixture models (GMMs) of Japanese [s] and [ʃ]. The GMMs, trained on the Corpus of Spontaneous Japanese, differed in feature vectors (MFCC, PLP, acoustic features), covariance matrix types (full, tied, diagonal, spherical), and numbers of mixtures (1–20). Using ten-fold cross validation, it was found that GMMs trained on MFCC features had the best sibilant classification accuracies (87.4–90.4%), but their correlations with human perceptual data were non-conclusive (0.35–0.98). Acoustic feature-based GMMs with tied covariance matrices had near human-like synthetic stimuli perception (0.957–0.996), but their classification performance was poor (71.3–80.4%). Models trained on perceptual linear prediction (PLP) features were on par with the acoustic feature-based models in terms correlation to the perceptual experiment (0.884–0.995), while losing slightly on classification performance (86.1–88.9%) compared to MFCC models. Across the board correlation tests and mixture-effect models confirmed that GMMs with better sibilant classifying performance produced more human-like probability estimations on the synthetic sibilant continuum.

DOI: 10.21437/Interspeech.2016-325

Cite as

Pintér, G., Watanabe, H. (2016) Do GMM Phoneme Classifiers Perceive Synthetic Sibilants as Humans Do?. Proc. Interspeech 2016, 1363-1367.

author={Gábor Pintér and Hiroki Watanabe},
title={Do GMM Phoneme Classifiers Perceive Synthetic Sibilants as Humans Do?},
booktitle={Interspeech 2016},