Competing Speaker Count Estimation on the Fusion of the Spectral and Spatial Embedding Space

Chao Peng, Xihong Wu, Tianshu Qu


This paper presents a method for estimating the competing speaker count with deep spectral and spatial embedding fusion. The basic idea is that mixed speech can be projected into an embedding space using neural networks where embedding vectors are orthogonal for different speakers while parallel for the same speaker. Therefore, speaker count estimation can be performed by computing the rank of the mean covariance matrix of the embedding vectors. It is also a feature combination method in speaker embedding space instead of simply combining features at the input layer of neural networks. Experimental results show that embedding-based method is better than classification-based method where the network directly predicts the count of speakers and spatial features help to speaker count estimation. In addition, the features combined in the embedding space can achieve more accurate speaker counting than features combined at the input layer of neural networks when tested on anechoic and reverberant datasets.


 DOI: 10.21437/Interspeech.2020-1781

Cite as: Peng, C., Wu, X., Qu, T. (2020) Competing Speaker Count Estimation on the Fusion of the Spectral and Spatial Embedding Space. Proc. Interspeech 2020, 3077-3081, DOI: 10.21437/Interspeech.2020-1781.


@inproceedings{Peng2020,
  author={Chao Peng and Xihong Wu and Tianshu Qu},
  title={{Competing Speaker Count Estimation on the Fusion of the Spectral and Spatial Embedding Space}},
  year=2020,
  booktitle={Proc. Interspeech 2020},
  pages={3077--3081},
  doi={10.21437/Interspeech.2020-1781},
  url={http://dx.doi.org/10.21437/Interspeech.2020-1781}
}