Multiview Shared Subspace Learning Across Speakers and Speech Commands

Krishna Somandepalli, Naveen Kumar, Arindam Jati, Panayiotis Georgiou, Shrikanth Narayanan

In many speech processing applications, the objective is to model different modes of variability to obtain robust speech features. In this paper, we learn speech representations in a multiview paradigm by constraining the views to known modes of variability such as speakers or spoken words. We use deep multiset canonical correlation (dMCCA) because it can model more than two views in parallel to learn a shared subspace across them. In order to model thousands of views (e.g., speakers), we demonstrate that stochastically sampling a small number of views generalizes dMCCA to the larger set of views. To evaluate our approach, we study two different aspects of the Speech Commands Dataset: variability among the speakers and speech commands. We show that, by treating observations from one mode of variability as multiple parallel views, we can learn representations that are discriminative to the other mode. We first consider different speakers as views of the same word to learn their shared subspace to represent an utterance. We then constrain the different words spoken by the same person as multiple views to learn speaker representations. Using classification and unsupervised clustering, we evaluate the efficacy of multiview representations to identify speech commands and speakers.

 DOI: 10.21437/Interspeech.2019-3130

Cite as: Somandepalli, K., Kumar, N., Jati, A., Georgiou, P., Narayanan, S. (2019) Multiview Shared Subspace Learning Across Speakers and Speech Commands. Proc. Interspeech 2019, 2320-2324, DOI: 10.21437/Interspeech.2019-3130.

  author={Krishna Somandepalli and Naveen Kumar and Arindam Jati and Panayiotis Georgiou and Shrikanth Narayanan},
  title={{Multiview Shared Subspace Learning Across Speakers and Speech Commands}},
  booktitle={Proc. Interspeech 2019},