BLSTM-Driven Stream Fusion for Automatic Speech Recognition: Novel Methods and a Multi-Size Window Fusion Example

Timo Lohrenz, Tim Fingscheidt


Optimal fusion of streams for ASR is a nontrivial problem. Recently, so-called posterior-in-posterior-out (PIPO-)BLSTMs have been proposed that serve as state sequence enhancers and have highly attractive training properties. In this work, we adopt the PIPO-BLSTMs and employ them in the context of stream fusion for ASR. Our contributions are the following: First, we show the positive effect of a PIPO-BLSTM as state sequence enhancer for various stream fusion approaches. Second, we confirm the advantageous context-free (CF) training property of the PIPO-BLSTM for all investigated fusion approaches. Third, we show with a fusion example of two streams, stemming from different short-time Fourier transform window lengths, that all investigated fusion approaches take profit. Finally, the turbo fusion approach turns out to be best, employing a CF-type PIPO-BLSTM with a novel iterative augmentation in training.


 DOI: 10.21437/Interspeech.2020-2560

Cite as: Lohrenz, T., Fingscheidt, T. (2020) BLSTM-Driven Stream Fusion for Automatic Speech Recognition: Novel Methods and a Multi-Size Window Fusion Example. Proc. Interspeech 2020, 26-30, DOI: 10.21437/Interspeech.2020-2560.


@inproceedings{Lohrenz2020,
  author={Timo Lohrenz and Tim Fingscheidt},
  title={{BLSTM-Driven Stream Fusion for Automatic Speech Recognition: Novel Methods and a Multi-Size Window Fusion Example}},
  year=2020,
  booktitle={Proc. Interspeech 2020},
  pages={26--30},
  doi={10.21437/Interspeech.2020-2560},
  url={http://dx.doi.org/10.21437/Interspeech.2020-2560}
}