What Does an End-to-End Dialect Identification Model Learn About Non-Dialectal Information?

Shammur A. Chowdhury, Ahmed Ali, Suwon Shon, James Glass


An end-to-end dialect identification system generates the likelihood of each dialect, given a speech utterance. The performance relies on its capabilities to discriminate the acoustic properties between the different dialects, even though the input signal contains non-dialectal information such as speaker and channel. In this work, we study how non-dialectal information are encoded inside the end-to-end dialect identification model. We design several proxy tasks to understand the model’s ability to represent speech input for differentiating non-dialectal information — such as (a) gender and voice identity of speakers, (b) languages, (c) channel (recording and transmission) quality — and compare with dialectal information (i.e., predicting geographic region of the dialects). By analyzing non-dialectal representations from layers of an end-to-end Arabic dialect identification (ADI) model, we observe that the model retains gender and channel information throughout the network while learning a speaker-invariant representation. Our findings also suggest that the CNN layers of the end-to-end model mirror feature extractors capturing voice-specific information, while the fully-connected layers encode more dialectal information.


 DOI: 10.21437/Interspeech.2020-2235

Cite as: Chowdhury, S.A., Ali, A., Shon, S., Glass, J. (2020) What Does an End-to-End Dialect Identification Model Learn About Non-Dialectal Information?. Proc. Interspeech 2020, 462-466, DOI: 10.21437/Interspeech.2020-2235.


@inproceedings{Chowdhury2020,
  author={Shammur A. Chowdhury and Ahmed Ali and Suwon Shon and James Glass},
  title={{What Does an End-to-End Dialect Identification Model Learn About Non-Dialectal Information?}},
  year=2020,
  booktitle={Proc. Interspeech 2020},
  pages={462--466},
  doi={10.21437/Interspeech.2020-2235},
  url={http://dx.doi.org/10.21437/Interspeech.2020-2235}
}