Speech Quality Evaluation of Synthesized Japanese Speech Using EEG

Ivan Halim Parmonangan, Hiroki Tanaka, Sakriani Sakti, Shinnosuke Takamichi, Satoshi Nakamura

As synthesized speech technology becomes more widely used, the synthesized speech quality must be assessed to ensure that it is acceptable. Subjective evaluation metrics, such as mean opinion score (MOS), can only provide an overall impression without any further detailed information about the speech. Therefore, this study proposes predicting speech quality using electroencephalographs (EEG), which are more objective and have high temporal resolution. In this paper, we use one natural speech and four types of synthesized speech lasting two to six seconds. First, to obtain ground truth of MOS, we gathered ten subjects to give opinion score on a scale of one to five for each recording. Second, another nine subjects were asked to measure how close to natural speech each synthesized speech sounded. The subjects’ EEGs were recorded while they were listening to and evaluating the listened speech. The best accuracy achieved for classification was 96.61% using support vector machine, 80.36% using linear discriminant analysis, and 59.9% using logistic regression. For regression, we achieved root mean squared error as low as 1.133 using SVR and 1.353 using linear regression. This study demonstrates that EEG could be used to evaluate the perceived speech quality objectively.

 DOI: 10.21437/Interspeech.2019-2059

Cite as: Parmonangan, I.H., Tanaka, H., Sakti, S., Takamichi, S., Nakamura, S. (2019) Speech Quality Evaluation of Synthesized Japanese Speech Using EEG. Proc. Interspeech 2019, 1228-1232, DOI: 10.21437/Interspeech.2019-2059.

  author={Ivan Halim Parmonangan and Hiroki Tanaka and Sakriani Sakti and Shinnosuke Takamichi and Satoshi Nakamura},
  title={{Speech Quality Evaluation of Synthesized Japanese Speech Using EEG}},
  booktitle={Proc. Interspeech 2019},