Improving Code-Switched Language Modeling Performance Using Cognate Features

Victor Soto, Julia Hirschberg

We have found that cognate words, defined as sets of words used in multiple languages that share a common etymology, can in fact elicit code-switching or language mixing between the languages. This paper focuses on how information about cognate words can improve language modeling performance of code-switched English-Spanish (EN-ES) language. We have found that the degree of semantic, phonetic or lexical overlap between a pair of cognate words is a useful feature in identifying code-switching in language. We derive a set of spelling, phonetic and semantic features from a list of of EN-ES cognates and run experiments on a corpus of conversational code-switched EN-ES. First, we show that there exists a strong statistical relationship between these cognate-based features and code-switching in the corpus. Secondly, we demonstrate that language models using these features obtain similar performance improvements as do other manually tagged features including language and part-of-speech tags. We conclude that cognate features can be a useful set of automatically-derived features that can be easily obtained for any pair of languages.

 DOI: 10.21437/Interspeech.2019-2681

Cite as: Soto, V., Hirschberg, J. (2019) Improving Code-Switched Language Modeling Performance Using Cognate Features. Proc. Interspeech 2019, 3725-3729, DOI: 10.21437/Interspeech.2019-2681.

  author={Victor Soto and Julia Hirschberg},
  title={{Improving Code-Switched Language Modeling Performance Using Cognate Features}},
  booktitle={Proc. Interspeech 2019},