Modeling ASR Ambiguity for Neural Dialogue State Tracking

Vaishali Pal, Fabien Guillot, Manish Shrivastava, Jean-Michel Renders, Laurent Besacier

Spoken dialogue systems typically use one or several (top-N) ASR sequence(s) for inferring the semantic meaning and tracking the state of the dialogue. However, ASR graphs, such as confusion networks (confnets), provide a compact representation of a richer hypothesis space than a top-N ASR list. In this paper, we study the benefits of using confusion networks with a neural dialogue state tracker (DST). We encode the 2-dimensional confnet into a 1-dimensional sequence of embeddings using a confusion network encoder which can be used with any DST system. Our confnet encoder is plugged into the ‘Global-locally Self-Attentive Dialogue State Tacker’ (GLAD) model for DST and obtains significant improvements in both accuracy and inference time compared to using top-N ASR hypotheses.

 DOI: 10.21437/Interspeech.2020-1783

Cite as: Pal, V., Guillot, F., Shrivastava, M., Renders, J., Besacier, L. (2020) Modeling ASR Ambiguity for Neural Dialogue State Tracking. Proc. Interspeech 2020, 1545-1549, DOI: 10.21437/Interspeech.2020-1783.

  author={Vaishali Pal and Fabien Guillot and Manish Shrivastava and Jean-Michel Renders and Laurent Besacier},
  title={{Modeling ASR Ambiguity for Neural Dialogue State Tracking}},
  booktitle={Proc. Interspeech 2020},