Ninth International Conference on Spoken Language Processing

Pittsburgh, PA, USA
September 17-21, 2006

Reconstructing Tongue Movements from Audio and Video

Hedvig Kjellström (1), Olov Engwall (2), Olle Bälter (2)

(1) FOI, Sweden; (2) KTH, Stockholm, Sweden

This paper presents an approach to articulatory inversion using audio and video of the user¡¯s face, requiring no special markers. The video is stabilized with respect to the face, and the mouth region cropped out. The mouth image is projected into a learned independent component subspace to obtain a low-dimensional representation of the mouth appearance. The inversion problem is treated as one of regression; a non-linear regressor using relevance vector machines is trained with a dataset of simultaneous images of a subject¡¯s face, acoustic features and positions of magnetic coils glued to the subjects¡¯s tongue. The results show the benefit of using both cues for inversion. We envisage the inversion method to be part of a pronunciation training system with articulatory feedback.

Full Paper

Bibliographic reference.  Kjellström, Hedvig / Engwall, Olov / Bälter, Olle (2006): "Reconstructing tongue movements from audio and video", In INTERSPEECH-2006, paper 1071-Thu1A3O.4.