FAAVSP - The 1st Joint Conference on Facial Analysis, Animation, and
Auditory-Visual Speech Processing

Vienna, Austria
September 11-13, 2015

From Text-to-Speech (TTS) to Talking Head - A Machine Learning Approach to A/V Speech Modeling and Rendering.

Frank Soong, Lijuan Wang

Microsoft Research Asia

In this talk, we will present our research results in A/V speech modeling and rendering via a statistical, machine learning approach. A Gaussian Mixture Model (GMM) based Hidden Markov Model (HMM) will be reviewed first in speech modeling where GMM is for modeling the stochastic nature of speech production while HMM, for characterizing the Markovian nature of speech parameter trajectories. All speech parametric models are estimated via an EM algorithm based maximum likelihood procedure and the resultant models are used to generate speech parameter trajectories for a given text input, say a sentence, in the maximum probability sense. Thus generated parameters is then used to synthesize corresponding speech waveforms via a vocoder or to render high quality output speech by our "trajectory tiling algorithm" where appropriate segments of the training speech database are used to "tile" the generated trajectory optimally. Similarly, the lips movement of a talking head, along with the jointly moving articulatory parts like jaw, tongue and teeth, can also be trained and rendered according to the optimization procedure. The visual parameters of a talking head can be collected via 2D- or 3D-video (via stereo, multi-camera recording equipment or consumer grade, capturing devices like Microsoft Kinect) and the corresponding visual trajectories of intensity, color and spatial coordinates are modeled and synthesized similarly. Recently, feedforward Deep Neural Net (DNN) and Recurrent Neural Net machine learning algorithms have been applied to speech modeling for both recognition and synthesis applications. We have deployed both forms of neural nets in TTS training successfully. The RNN, particularly, with a longer memory can model speech prosody of longer contexts in speech, say in a sentence, better. We will also cover the topics of cross-lingual TTS and talking head modeling, where audio and visual data collected in one source language can be used to train a TTS or talking head in a different target language. The mouth shapes of a mono-lingual speaker have also been found adequate for rendering synced lips movement of talking heads in different languages. Various demos of TTS and talking head will be shown to illustrate our research findings.

Frank K. Soong is a Principal Researcher and Research Manager, Speech Group, Microsoft Research Asia (MSRA), Beijing, China, where he works on fundamental research on speech and its practical applications. His professional research career spans over 30 years, first with Bell Labs, US, then with ATR, Japan, before joining MSRA in 2004. At Bell Labs, he worked on stochastic modeling of speech signals, optimal decoder algorithm, speech analysis and coding, speech and speaker recognition. He was responsible for developing the recognition algorithm which was developed into voice-activated mobile phone products rated by the Mobile Office Magazine (Apr. 1993) as the "outstandingly the best". He is a co-recipient of the Bell Labs President Gold Award for developing the Bell Labs Automatic Speech Recognition (BLASR) software package. He has served as a member of the Speech and Language Technical Committee, IEEE Signal Processing Society and other society functions, including Associate Editor of the IEEE Speech and Audio Transactions and chairing IEEE Workshop. He published extensively with more than 200 papers and co-edited a widely used reference book, Automatic Speech and Speech Recognition- Advanced Topics, Kluwer, 1996. He is a visiting professor of the Chinese University of Hong Kong (CUHK) and a few other top-rated universities in China. He is also the co-Director of the National MSRA-CUHK Joint Research Lab. He got his BS, MS and PhD from National Taiwan Univ., Univ. of Rhode Island, and Stanford Univ, all in Electrical Eng. He is an IEEE Fellow "for contributions to digital processing of speech".

Lijuan Wang received B.E. from Huazhong Univ. of Science and Technology and Ph.D. from Tsinghua Univ., China in 2001 and 2006 respectively. In 2006, she joined the speech group of Microsoft Research Asia, where she is currently a lead researcher. Her research areas include audio-visual speech synthesis, deep learning (feedforward and recurrent neural networks), and speech synthesis (TTS)/recognition. She has published more than 25 papers on top conferences and journals and she is the inventor/co-inventor of more than 10 granted/pending USA patents. She is a senior member of IEEE and a member of ISCA.

Bibliographic reference.  Soong, Frank / Wang, Lijuan (2015): "From text-to-speech (TTS) to talking head - a machine learning approach to a/v speech modeling and rendering.", In FAAVSP-2015 (abstract).