13th Annual Conference of the International Speech Communication Association

Portland, OR, USA
September 9-13, 2012

Inventory-Based Audio-Visual Speech Enhancement

Dorothea Kolossa (1), Robert Nickel (2), Steffen Zeiler (1), Rainer Martin (1)

(1) Institut für Kommunikationsakustik, ID 2/231, Ruhr-Universität Bochum, Germany
(2) Department of Electrical Engineering, Bucknell University, Lewisburg, PA, USA

In this paper we propose to combine audio-visual speech recognition with inventory-based speech synthesis for speech enhancement. Unlike traditional filtering-based speech enhancement, inventory-based speech synthesis avoids the usual trade-off between noise reduction and consequential speech distortion. For this purpose, the processed speech signal is composed from a given speech inventory which contains snippets of speech from a targeted speaker. However, the combination of speech recognition and synthesis is susceptible to noise as recognition errors can lead to a suboptimal selection of speech segments. The search for fitting clean speech segments can be significantly improved when audio-visual information is utilized by means of a coupled HMM recognizer and an uncertainty decoding framework. First results using this novel system are reported in terms of several instrumental measures for three types of noise.

Index Terms: audio-visual speech enhancement, speech synthesis, unit selection, missing data techniques

Full Paper

Bibliographic reference.  Kolossa, Dorothea / Nickel, Robert / Zeiler, Steffen / Martin, Rainer (2012): "Inventory-based audio-visual speech enhancement", In INTERSPEECH-2012, 587-590.