Phonotactic Language Identification for Singing

Anna M. Kruspe

In the past decades, many successful approaches for language identification have been published. However, almost none of these approaches were developed with singing in mind. Singing has a lot of characteristics that differ from speech, such as a wider variance of fundamental frequencies and phoneme durations, vibrato, pronunciation differences, and different semantic content.

We present a new phonotactic language identification system for singing based on phoneme posteriorgrams. These posteriorgrams were extracted using acoustic models trained on English speech ( TIMIT) and on an unannotated English-language a-capella singing dataset ( DAMP). SVM models were then trained on phoneme statistics.

The models are evaluated on a set of amateur singing recordings from YouTube, and, for comparison, on the OGI Multilanguage corpus.

While the results on a-capella singing are somewhat worse than the ones previously obtained using i-vector extraction, this approach is easier to implement. Phoneme posteriorgrams need to be extracted for many applications, and can easily be employed for language identification using this approach. The results on singing improve significantly when the utilized acoustic models have also been trained on singing. Interestingly, the best results on the OGI speech corpus are also obtained when acoustic models trained on singing are used.

DOI: 10.21437/Interspeech.2016-131

Cite as

Kruspe, A.M. (2016) Phonotactic Language Identification for Singing. Proc. Interspeech 2016, 3319-3323.

author={Anna M. Kruspe},
title={Phonotactic Language Identification for Singing},
booktitle={Interspeech 2016},