Extending an Acoustic Data-Driven Phone Set for Spontaneous Speech Recognition

Jeong-Uk Bang, Mu-Yeol Choi, Sang-Hun Kim, Oh-Wook Kwon

In this paper, we propose a method to extend a phone set by using a large amount of Korean broadcast data to improve the performance of spontaneous speech recognition. The proposed method first extracts variable-length phoneme-level segments from broadcast data, and then converts them into fixed-length latent vectors based on an LSTM architecture. Then, we used the k-means algorithm to cluster acoustically similar latent vectors and then build a new phone set by gathering the clustered vectors. To update the lexicon of a speech recognizer, we choose the pronunciation sequence of each word with the highest conditional probability. To verify the performance of the proposed unit, we visualize the spectral patterns and segment duration for the new phone set. In both spontaneous and read speech recognition tasks, the proposed unit is shown to produce better performance than the phoneme-based and grapheme-based units.

 DOI: 10.21437/Interspeech.2019-1979

Cite as: Bang, J., Choi, M., Kim, S., Kwon, O. (2019) Extending an Acoustic Data-Driven Phone Set for Spontaneous Speech Recognition. Proc. Interspeech 2019, 4405-4409, DOI: 10.21437/Interspeech.2019-1979.

  author={Jeong-Uk Bang and Mu-Yeol Choi and Sang-Hun Kim and Oh-Wook Kwon},
  title={{Extending an Acoustic Data-Driven Phone Set for Spontaneous Speech Recognition}},
  booktitle={Proc. Interspeech 2019},