In this paper, we investigate the important implications of realtime processing to the design of a speech activity detection (SAD) system, with a focus on the impact of the unique constraints posed by online automatic speech recognition. Our investigation is built on a real-life application of speech technology, the BBN Broadcast Monitoring System (BMS), which encapsulates a real-time automatic rich transcription system. We propose a segmentation method that is capable of variable scale speech boundary detection in an online SAD system and evaluate how different granularities of boundary detection impact the performance of speech-to-text (STT) and speaker diarization. In addition, the interactions between STT and speaker diarization are evaluated and mechanisms for trading off the performance of these two system components are studied. In our experiment, the segmentation mechanism in the proposed SAD system reduces error rates of STT and speaker diarization by 2.4% and 9.5% relatively, compared to the baseline system.
Bibliographic reference. Gao, Chao / Saikumar, Guruprasad / Khanwalkar, Saurabh / Herscovici, Avi / Kumar, Anoop / Srivastava, Amit / Natarajan, Premkumar (2011): "Online speech activity detection in broadcast news", In INTERSPEECH-2011, 2637-2640.