Language Modeling for Speech Analytics in Under-Resourced Languages

Simone Wills, Pieter Uys, Charl van Heerden, Etienne Barnard


Different language modeling approaches are evaluated on two under-resourced, agglutinative, South African languages; Sesotho and isiZulu. The two languages present different challenges to language modeling based on their respective orthographies; isiZulu is conjunctively written whereas Sotho is disjunctively written. Two subword modeling approaches are evaluated and shown to be useful to reduce the OOV rate for isiZulu, and for Sesotho, a multi-word approach is evaluated for improving ASR accuracy, with limited success. RNNs are also evaluated and shown to slightly improve ASR accuracy, despite relatively small text corpora.


 DOI: 10.21437/Interspeech.2020-1586

Cite as: Wills, S., Uys, P., Heerden, C.V., Barnard, E. (2020) Language Modeling for Speech Analytics in Under-Resourced Languages. Proc. Interspeech 2020, 4941-4945, DOI: 10.21437/Interspeech.2020-1586.


@inproceedings{Wills2020,
  author={Simone Wills and Pieter Uys and Charl van Heerden and Etienne Barnard},
  title={{Language Modeling for Speech Analytics in Under-Resourced Languages}},
  year=2020,
  booktitle={Proc. Interspeech 2020},
  pages={4941--4945},
  doi={10.21437/Interspeech.2020-1586},
  url={http://dx.doi.org/10.21437/Interspeech.2020-1586}
}