Vector Quantized Temporally-Aware Correspondence Sparse Autoencoders for Zero-Resource Acoustic Unit Discovery

Batuhan Gundogdu, Bolaji Yusuf, Mansur Yesilbursa, Murat Saraclar


A recent task posed by the Zerospeech challenge is the unsupervised learning of the basic acoustic units that exist in an unknown language. Previously, we introduced recurrent sparse autoencoders fine-tuned with corresponding speech segments obtained by unsupervised term discovery. There, the clustering was obtained on the intermediate layer where the nodes represent the acoustic unit assignments. In this paper, we extend this system by incorporating vector quantization and an adaptation of the winner-take-all networks. This way, symbol continuity could be enforced by excitatory and inhibitory weights along the temporal axis. Furthermore, in this work, we utilized the speaker information in a speaker adversarial training on the encoder. The ABX discriminability and the low bitrate results of our proposed approach on the Zerospeech 2020 challenge demonstrate the effect of the enhanced continuity of the encoding brought by the temporal-awareness and sparsity techniques proposed in this work.


 DOI: 10.21437/Interspeech.2020-2765

Cite as: Gundogdu, B., Yusuf, B., Yesilbursa, M., Saraclar, M. (2020) Vector Quantized Temporally-Aware Correspondence Sparse Autoencoders for Zero-Resource Acoustic Unit Discovery. Proc. Interspeech 2020, 4846-4850, DOI: 10.21437/Interspeech.2020-2765.


@inproceedings{Gundogdu2020,
  author={Batuhan Gundogdu and Bolaji Yusuf and Mansur Yesilbursa and Murat Saraclar},
  title={{Vector Quantized Temporally-Aware Correspondence Sparse Autoencoders for Zero-Resource Acoustic Unit Discovery}},
  year=2020,
  booktitle={Proc. Interspeech 2020},
  pages={4846--4850},
  doi={10.21437/Interspeech.2020-2765},
  url={http://dx.doi.org/10.21437/Interspeech.2020-2765}
}