This work proposes a method of single-channel speaker separation that uses visual speech information to extract a target speakerfs speech from a mixture of speakers. The method requires a single audio input and visual features extracted from the mouth region of each speaker in the mixture. The visual information from speakers is used to create a visually-derived Wiener filter. The Wiener filter gains are then non-linearly adjusted by a perceptual gain transform to improve the quality and intelligibility of the target speech. Experimental results are presented that estimate the quality and intelligibility of the extracted target speaker and a comparison is made of different perceptual gain transforms. These show that significant gains are achieved by the application of the perceptual gain function.
Bibliographic reference. Khan, Faheem / Milner, Ben (2013): "Speaker separation using visual speech features and single-channel audio", In INTERSPEECH-2013, 3264-3268.