Auditory-Visual Speech Processing (AVSP) 2010
Hakone, Kanagawa, Japan
This paper introduces a general approach for binary classification of audiovisual data. The intended application is mispronunciation detection for specific phonemic errors, using very sparse training data. The system uses a Support Vector Machine (SVM) classifier with features obtained from a Time Varying Discrete Cosine Transform (TV-DCT) on the audio log-spectrum as well as on the image sequences. The concatenated feature vectors from both the modalities were reduced to a very small subset using a combination of feature selection methods. We achieved 95-100% correct classification for each pair-wise classifier on a database of Swedish vowels with an average of 58 instances per vowel for training. The performance was largely unaffected when tested on data from a speaker who was not included in the training.
Index Terms: Time Varying-DCT, Genetic Algorithms, MRMR, CAPT
Bibliographic reference. Picard, Sébastien / Ananthakrishnan, G. / Wik, Preben / Engwall, Olov / Abdou, Sherif (2010): "Detection of specific mispronunciations using audiovisual features", In AVSP-2010, paper S7-2.