Auditory-Visual Speech Processing (AVSP) 2010

Hakone, Kanagawa, Japan
September 30-October 3, 2010

Detection of Specific Mispronunciations using Audiovisual Features

Sébastien Picard (1,2), G. Ananthakrishnan (2), Preben Wik (2), Olov Engwall (2), Sherif Abdou (3)

(1) Electronics, Telecommunications and Computer Sciences, University of Lyon, France
(2) Centre for Speech Technology, KTH, (Royal Institute of Technology), Stockholm, Sweden
(3) Faculty of Computers & Information, Cairo University, Egypt

This paper introduces a general approach for binary classification of audiovisual data. The intended application is mispronunciation detection for specific phonemic errors, using very sparse training data. The system uses a Support Vector Machine (SVM) classifier with features obtained from a Time Varying Discrete Cosine Transform (TV-DCT) on the audio log-spectrum as well as on the image sequences. The concatenated feature vectors from both the modalities were reduced to a very small subset using a combination of feature selection methods. We achieved 95-100% correct classification for each pair-wise classifier on a database of Swedish vowels with an average of 58 instances per vowel for training. The performance was largely unaffected when tested on data from a speaker who was not included in the training.

Index Terms: Time Varying-DCT, Genetic Algorithms, MRMR, CAPT

Full Paper

Bibliographic reference.  Picard, Sébastien / Ananthakrishnan, G. / Wik, Preben / Engwall, Olov / Abdou, Sherif (2010): "Detection of specific mispronunciations using audiovisual features", In AVSP-2010, paper S7-2.