A Framework for Practical Multistream ASR

Sri Harish Mallidi, Hynek Hermansky

Robustness of automatic speech recognition (ASR) to acoustic mismatches can be improved by using multistream architecture. Past multistream approaches involve training large number of neural networks, one for each possible stream combination. During testing phase, each utterance is forward passed through all the neural networks to estimate best stream combination. In this work, we propose a new framework to reduce the complexity of multistream architecture. We show that multiple neural networks, used in the past approaches, can be replaced by a single neural network. This results in a significant decrease in the number of parameters used in the system. The test time complexity is also reduced by organizing the stream combinations in a tree structure, where each node in the tree represent a stream combination. Instead of traversing through all the nodes, we traverse through paths which resulted in a increase in the performance monitor score. Compared to state-of-the-art baseline system, the proposed approach resulted in 13.5% relative improvement word-error-rate (WER) in Aurora4 speech recognition task. We also obtained an average of 0.7% absolute decrease in WER in 5 IARPRA-BABEL Year 4 languages.

DOI: 10.21437/Interspeech.2016-619

Cite as

Mallidi, S.H., Hermansky, H. (2016) A Framework for Practical Multistream ASR. Proc. Interspeech 2016, 3474-3478.

author={Sri Harish Mallidi and Hynek Hermansky},
title={A Framework for Practical Multistream ASR},
booktitle={Interspeech 2016},