Unsupervised Methods for Evaluating Speech Representations

Michael Gump, Wei-Ning Hsu, James Glass

Disentanglement is a desired property in representation learning and a significant body of research has tried to show that it is a useful representational prior. Evaluating disentanglement is challenging, particularly for real world data like speech, where ground truth generative factors are typically not available. Previous work on disentangled representation learning in speech has used categorical supervision like phoneme or speaker identity in order to disentangle grouped feature spaces. However, this work differs from the typical dimension-wise view of disentanglement in other domains. This paper proposes to use low-level acoustic features to provide the structure required to evaluate dimension-wise disentanglement. By choosing well-studied acoustic features, grounded and descriptive evaluation is made possible for unsupervised representation learning. This work produces a toolkit for evaluating disentanglement in unsupervised representations of speech and evaluates its efficacy on previous research.

 DOI: 10.21437/Interspeech.2020-2990

Cite as: Gump, M., Hsu, W., Glass, J. (2020) Unsupervised Methods for Evaluating Speech Representations. Proc. Interspeech 2020, 170-174, DOI: 10.21437/Interspeech.2020-2990.

  author={Michael Gump and Wei-Ning Hsu and James Glass},
  title={{Unsupervised Methods for Evaluating Speech Representations}},
  booktitle={Proc. Interspeech 2020},