Automatic Estimation of Pathological Voice Quality Based on Recurrent Neural Network Using Amplitude and Phase Spectrogram

Shunsuke Hidaka, Yogaku Lee, Kohei Wakamiya, Takashi Nakagawa, Tokihiko Kaburagi


Perceptual evaluation of voice quality is widely used in laryngological practice, but it lacks reproducibility caused by inter- and intra-rater variability. This problem can be solved by automatic estimation of voice quality using machine learning. In the previous studies, conventional acoustic features, such as jitter, have often been employed as inputs. However, many of them are vulnerable to severe hoarseness because they assume a quasi-periodicity of voice. This paper investigated non-parametric features derived from amplitude and phase spectrograms. We applied the instantaneous phase correction proposed by Yatabe et al. (2018) to extract features that could be interpreted as indicators of non-sinusoidality. Specifically, we compared log amplitude, temporal phase variation, temporal complex value variation, and mel-scale versions of them. A deep neural network with a bidirectional GRU was constructed for each item of GRBAS Scale, a hoarseness evaluation method. The dataset was composed of 2545 samples of sustained vowel /a/ with the GRBAS scores labeled by an otolaryngologist. The results showed that the Hz-mel conversion improved the performance in almost all the case. The best scores were obtained when using temporal phase variation along the mel scale for Grade, Rough, Breathy, and Strained, and when using log mel amplitude for Asthenic.


 DOI: 10.21437/Interspeech.2020-3228

Cite as: Hidaka, S., Lee, Y., Wakamiya, K., Nakagawa, T., Kaburagi, T. (2020) Automatic Estimation of Pathological Voice Quality Based on Recurrent Neural Network Using Amplitude and Phase Spectrogram. Proc. Interspeech 2020, 3880-3884, DOI: 10.21437/Interspeech.2020-3228.


@inproceedings{Hidaka2020,
  author={Shunsuke Hidaka and Yogaku Lee and Kohei Wakamiya and Takashi Nakagawa and Tokihiko Kaburagi},
  title={{Automatic Estimation of Pathological Voice Quality Based on Recurrent Neural Network Using Amplitude and Phase Spectrogram}},
  year=2020,
  booktitle={Proc. Interspeech 2020},
  pages={3880--3884},
  doi={10.21437/Interspeech.2020-3228},
  url={http://dx.doi.org/10.21437/Interspeech.2020-3228}
}