13th Annual Conference of the International Speech Communication Association

Portland, OR, USA
September 9-13, 2012

Normalization of Text Messages Using Character- and Phone-based Machine Translation Approaches

Chen Li, Yang Liu

Computer Science Department, The University of Texas at Dallas, Richardson, TX, USA

There are many abbreviation and non-standard words in SMS and Twitter messages. They are problematic for text-to-speech (TTS) or language processing techniques for these data. A character-based machine translation (MT) approach was previously used for normalization of nonstandard words. In this paper, we propose a two-step translation method to leverage phonetic information, where non-standard words are first translated to possible pronunciations, which are then translated to standard words. We further combine it with the single-step character-based translation module. Our experiments show that our proposed method significantly outperforms previous results in both n-best coverage and 1-best accuracy.

Index Terms: text normalization, text-to-speech, abbreviation

Full Paper

Bibliographic reference.  Li, Chen / Liu, Yang (2012): "Normalization of text messages using character- and phone-based machine translation approaches", In INTERSPEECH-2012, 2330-2333.