Multi-Modal Sentiment Analysis Using Deep Canonical Correlation Analysis

Zhongkai Sun, Prathusha K. Sarma, William Sethares, Erik P. Bucy

This paper learns multi-modal embeddings from text, audio, and video views/modes of data in order to improve upon downstream sentiment classification. The experimental framework also allows investigation of the relative contributions of the individual views in the final multi-modal embedding. Individual features derived from the three views are combined into a multi-modal embedding using Deep Canonical Correlation Analysis (DCCA) in two ways i) One-Step DCCA and ii) Two-Step DCCA. This paper learns text embeddings using BERT, the current state-of-the-art in text encoders. We posit that this highly optimized algorithm dominates over the contribution of other views, though each view does contribute to the final result. Classification tasks are carried out on two benchmark data sets and on a new Debate Emotion data set, and together these demonstrate that the one-Step DCCA outperforms the current state-of-the-art in learning multi-modal embeddings.

 DOI: 10.21437/Interspeech.2019-2482

Cite as: Sun, Z., Sarma, P.K., Sethares, W., Bucy, E.P. (2019) Multi-Modal Sentiment Analysis Using Deep Canonical Correlation Analysis. Proc. Interspeech 2019, 1323-1327, DOI: 10.21437/Interspeech.2019-2482.

  author={Zhongkai Sun and Prathusha K. Sarma and William Sethares and Erik P. Bucy},
  title={{Multi-Modal Sentiment Analysis Using Deep Canonical Correlation Analysis}},
  booktitle={Proc. Interspeech 2019},