ASe: Acoustic Scene Embedding Using Deep Archetypal Analysis and GMM

Pulkit Sharma, Vinayak Abrol, Anshul Thakur

In this paper, we propose a deep learning framework which combines the generalizability of Gaussian mixture models (GMM) and discriminative power of deep matrix factorization to learn acoustic scene embedding (ASe) for the acoustic scene classification task. The proposed approach first builds a Gaussian mixture model-universal background model (GMM- UBM) using frame-wise spectral representations. This UBM is adapted to a waveform and the likelihood for each spectral frame representation is stored as a feature matrix. This matrix is fed to a deep matrix factorization pipeline (with audio recording level max-pooling) to compute a sparse-convex discriminative representation. The proposed deep factorization model is based on archetypal analysis, a form of convex NMF, which has been shown to be well suited for audio analysis. Finally, the obtained representation is mapped to a class label using a dictionary based auto-encoder consisting of linear and symmetric encoder and decoder with an efficient learning algorithm. The encoder projects the ASe of a waveform to the label space, while the decoder ensures that the feature can be reconstructed, resulting in better generalization on the test data.

 DOI: 10.21437/Interspeech.2018-1481

Cite as: Sharma, P., Abrol, V., Thakur, A. (2018) ASe: Acoustic Scene Embedding Using Deep Archetypal Analysis and GMM. Proc. Interspeech 2018, 3299-3303, DOI: 10.21437/Interspeech.2018-1481.

  author={Pulkit Sharma and Vinayak Abrol and Anshul Thakur},
  title={ASe: Acoustic Scene Embedding Using Deep Archetypal Analysis and GMM},
  booktitle={Proc. Interspeech 2018},