Autosegmental Neural Nets: Should Phones and Tones be Synchronous or Asynchronous?

Jialu Li, Mark Hasegawa-Johnson


Phones, the segmental units of the International Phonetic Alphabet (IPA), are used for lexical distinctions in most human languages; Tones, the suprasegmental units of the IPA, are used in perhaps 70%. Many previous studies have explored cross-lingual adaptation of automatic speech recognition (ASR) phone models, but few have explored the multilingual and cross-lingual transfer of synchronization between phones and tones. In this paper, we test four Connectionist Temporal Classification (CTC)-based acoustic models, differing in the degree of synchrony they impose between phones and tones. Models are trained and tested multilingually in three languages, then adapted and tested cross-lingually in a fourth. Both synchronous and asynchronous models are effective in both multilingual and cross-lingual settings. Synchronous models achieve lower error rate in the joint phone+tone tier, but asynchronous training results in lower tone error rate.


 DOI: 10.21437/Interspeech.2020-1834

Cite as: Li, J., Hasegawa-Johnson, M. (2020) Autosegmental Neural Nets: Should Phones and Tones be Synchronous or Asynchronous?. Proc. Interspeech 2020, 1027-1031, DOI: 10.21437/Interspeech.2020-1834.


@inproceedings{Li2020,
  author={Jialu Li and Mark Hasegawa-Johnson},
  title={{Autosegmental Neural Nets: Should Phones and Tones be Synchronous or Asynchronous?}},
  year=2020,
  booktitle={Proc. Interspeech 2020},
  pages={1027--1031},
  doi={10.21437/Interspeech.2020-1834},
  url={http://dx.doi.org/10.21437/Interspeech.2020-1834}
}