Multimodal Deception Detection Using Automatically Extracted Acoustic, Visual, and Lexical Features

Jiaxuan Zhang, Sarah Ita Levitan, Julia Hirschberg


Deception detection in conversational dialogue has attracted much attention in recent years. Yet existing methods for this rely heavily on human-labeled annotations that are costly and potentially inaccurate. In this work, we present an automated system that utilizes multimodal features for conversational deception detection, without the use of human annotations. We study the predictive power of different modalities and combine them for better performance. We use openSMILE to extract acoustic features after applying noise reduction techniques to the original audio. Facial landmark features are extracted from the visual modality. We experiment with training facial expression detectors and applying Fisher Vectors to encode sequences of facial landmarks with varying length. Linguistic features are extracted from automatic transcriptions of the data. We examine the performance of these methods on the Box of Lies dataset of deception game videos, achieving 73% accuracy using features from all modalities. This result is significantly better than previous results on this corpus which relied on manual annotations, and also better than human performance.


 DOI: 10.21437/Interspeech.2020-2320

Cite as: Zhang, J., Levitan, S.I., Hirschberg, J. (2020) Multimodal Deception Detection Using Automatically Extracted Acoustic, Visual, and Lexical Features. Proc. Interspeech 2020, 359-363, DOI: 10.21437/Interspeech.2020-2320.


@inproceedings{Zhang2020,
  author={Jiaxuan Zhang and Sarah Ita Levitan and Julia Hirschberg},
  title={{Multimodal Deception Detection Using Automatically Extracted Acoustic, Visual, and Lexical Features}},
  year=2020,
  booktitle={Proc. Interspeech 2020},
  pages={359--363},
  doi={10.21437/Interspeech.2020-2320},
  url={http://dx.doi.org/10.21437/Interspeech.2020-2320}
}