There are many abbreviation and non-standard words in SMS and Twitter messages. They are problematic for text-to-speech (TTS) or language processing techniques for these data. A character-based machine translation (MT) approach was previously used for normalization of nonstandard words. In this paper, we propose a two-step translation method to leverage phonetic information, where non-standard words are first translated to possible pronunciations, which are then translated to standard words. We further combine it with the single-step character-based translation module. Our experiments show that our proposed method significantly outperforms previous results in both n-best coverage and 1-best accuracy.
Index Terms: text normalization, text-to-speech, abbreviation
Bibliographic reference. Li, Chen / Liu, Yang (2012): "Normalization of text messages using character- and phone-based machine translation approaches", In INTERSPEECH-2012, 2330-2333.