Recognition of Creaky Voice from Emergency Calls

Lauri Tavi, Tanel Alumäe, Stefan Werner

Although creaky voice, or vocal fry, is widely studied phonation mode, open questions still exist in creak’s acoustic characterization and automatic recognition. Many questions are open since creak varies significantly depending on conversational context. In this study, we introduce an exploratory creak recognizer based on convolutional neural network (CNN), which is generated specifically for emergency calls. The study focuses on recognition of creaky voice from authentic emergency calls because creak detection could potentially provide information about the caller’s emotional state or attempt of voice disguise. We generated the CNN recognition system using emergency call recordings and other out-of-domain speech recordings and compared the results with an already existing and widely used creaky voice detection system: using poor quality emergency call recordings as test data, this system achieved F1 of 0.41 whereas our CNN system accomplished an F1 of 0.64. The results show that the CNN system can perform moderately well using a limited amount of training data on challenging testing data and has the potential to achieve higher F scores when more emergency calls are used for model training.

 DOI: 10.21437/Interspeech.2019-1253

Cite as: Tavi, L., Alumäe, T., Werner, S. (2019) Recognition of Creaky Voice from Emergency Calls. Proc. Interspeech 2019, 1990-1994, DOI: 10.21437/Interspeech.2019-1253.

  author={Lauri Tavi and Tanel Alumäe and Stefan Werner},
  title={{Recognition of Creaky Voice from Emergency Calls}},
  booktitle={Proc. Interspeech 2019},