End-to-End Deep Learning Speech Recognition Model for Silent Speech Challenge

Naoki Kimura, Zixiong Su, Takaaki Saeki


This work is the first attempt to apply an end-to-end, deep neural network-based automatic speech recognition (ASR) pipeline to the Silent Speech Challenge dataset (SSC), which contains synchronized ultrasound images and lip images captured when a single speaker read the TIMIT corpus without uttering audible sounds. In silent speech research using SSC dataset, established methods in ASR have been utilized with some modifications to use it in visual speech recognition. In this work, we tested the SOTA method of ASR on the SSC dataset using the End-to-End Speech Processing Toolkit, ESPnet. The experimental results show that this end-to-end method achieved a character error rate (CER) of 10.1% and a WER of 20.5% by incorporating SpecAugment, demonstrating the possibility to further improve the performance with additional data collection.


Cite as: Kimura, N., Su, Z., Saeki, T. (2020) End-to-End Deep Learning Speech Recognition Model for Silent Speech Challenge. Proc. Interspeech 2020, 1025-1026.


@inproceedings{Kimura2020,
  author={Naoki Kimura and Zixiong Su and Takaaki Saeki},
  title={{End-to-End Deep Learning Speech Recognition Model for Silent Speech Challenge}},
  year=2020,
  booktitle={Proc. Interspeech 2020},
  pages={1025--1026}
}