Combining outputs of speech recognizers is a known way of increasing speech recognition performance. The ROVER approach handles efficiently such combinations. In this paper we show that the best performance is not achieved by combining the outputs of the best set of recognizers, but rather by combining outputs of recognizers that rely on different processing components, and in particular on a different order (backward vs. forward) for processing speech frames. Indeed, much better speech recognition results were obtained by combining outputs of sphinx-based recognizers with outputs of Julius-based recognizers than by combining the same number of outputs from only sphinx-based recognizers, even if the individual sphinx-based systems led to better results than the individual Julius-based recognizers. Further experiments have also been conducted using sphinx-based tools for processing speech frames in reverse order (i.e. backward in time). The results clearly show that combining forward-based and backward-based decoders provide significant improvement with respect to a combination of forward only or backward only decoders. Experiments have been conducted on the ESTER2 and ETAPE speech corpora. Overall, combining sphinx-based and Julius-based systems led to 18.6% word error rate on ESTER2 test data, and 24.5% word error rate on ETAPE test data.
Bibliographic reference. Jouvet, Denis / Fohr, Dominique (2013): "Combining forward-based and backward-based decoders for improved speech recognition performance", In INTERSPEECH-2013, 652-656.