A Neural Turn-Taking Model without RNN

Chaoran Liu, Carlos Ishi, Hiroshi Ishiguro

Sequential data such as speech and dialogs are usually modeled by Recurrent Neural Networks (RNN) and derivatives since the information can travel through time with such architecture. However, disadvantages exist with the use of RNNs, including the limited depth of neural networks and the GPU’s unfriendly training process.

Estimating the timing of turn-taking is a critical feature of dialog systems. Such tasks require knowledge about past dialog contexts and have been modeled using RNNs in several studies. In this paper, we propose a non-RNN model for the timing estimation of turn-taking in dialogs. The proposed model takes lexical and acoustic features as its input to predict a turn’s end. We conducted experiments on four types of Japanese conversation datasets and show that with proper neural network designs, the long-term information in a dialog could propagate without a recurrent structure. The proposed model outperformed canonical RNN-based architectures on a turn-taking estimation task.

 DOI: 10.21437/Interspeech.2019-2270

Cite as: Liu, C., Ishi, C., Ishiguro, H. (2019) A Neural Turn-Taking Model without RNN. Proc. Interspeech 2019, 4150-4154, DOI: 10.21437/Interspeech.2019-2270.

  author={Chaoran Liu and Carlos Ishi and Hiroshi Ishiguro},
  title={{A Neural Turn-Taking Model without RNN}},
  booktitle={Proc. Interspeech 2019},