Dynamic Stream Weighting for Turbo-Decoding-Based Audiovisual ASR

Sebastian Gergen, Steffen Zeiler, Ahmed Hussen Abdelaziz, Robert Nickel, Dorothea Kolossa

Automatic speech recognition (ASR) enables very intuitive human-machine interaction. However, signal degradations due to reverberation or noise reduce the accuracy of audio-based recognition. The introduction of a second signal stream that is not affected by degradations in the audio domain (e.g., a video stream) increases the robustness of ASR against degradations in the original domain. Here, depending on the signal quality of audio and video at each point in time, a dynamic weighting of both streams can optimize the recognition performance. In this work, we introduce a strategy for estimating optimal weights for the audio and video streams in turbo-decoding-based ASR using a discriminative cost function. The results show that turbo decoding with this maximally discriminative dynamic weighting of information yields higher recognition accuracy than turbo-decoding-based recognition with fixed stream weights or optimally dynamically weighted audiovisual decoding using coupled hidden Markov models.

DOI: 10.21437/Interspeech.2016-166

Cite as

Gergen, S., Zeiler, S., Abdelaziz, A.H., Nickel, R., Kolossa, D. (2016) Dynamic Stream Weighting for Turbo-Decoding-Based Audiovisual ASR. Proc. Interspeech 2016, 2135-2139.

author={Sebastian Gergen and Steffen Zeiler and Ahmed Hussen Abdelaziz and Robert Nickel and Dorothea Kolossa},
title={Dynamic Stream Weighting for Turbo-Decoding-Based Audiovisual ASR},
booktitle={Interspeech 2016},