Detecting Depression with Word-Level Multimodal Fusion

Morteza Rohanian, Julian Hough, Matthew Purver

Semi-structured clinical interviews are frequently used diagnostic tools for identifying depression during an assessment phase. In addition to the lexical content of a patient’s responses, multimodal cues concurrent with the responses are indicators of their motor and cognitive state, including those derivable from their voice quality and gestural behaviour. In this paper, we use information from different modalities in order to train a classifier capable of detecting the binary state of a subject (clinically depressed or not), as well as the level of their depression. We propose a model that is able to perform modality fusion incrementally after each word in an utterance using a time-dependent recurrent approach in a deep learning set-up. To mitigate noisy modalities, we utilize fusion gates that control the degree to which the audio or visual modality contributes to the final prediction. Our results show the effectiveness of word-level multimodal fusion, achieving state-of-the-art results in depression detection and outperforming early feature-level and late fusion techniques.

 DOI: 10.21437/Interspeech.2019-2283

Cite as: Rohanian, M., Hough, J., Purver, M. (2019) Detecting Depression with Word-Level Multimodal Fusion. Proc. Interspeech 2019, 1443-1447, DOI: 10.21437/Interspeech.2019-2283.

  author={Morteza Rohanian and Julian Hough and Matthew Purver},
  title={{Detecting Depression with Word-Level Multimodal Fusion}},
  booktitle={Proc. Interspeech 2019},