5th International Conference on Spoken Language Processing

Sydney, Australia
November 30 - December 4, 1998

Segmentation Using a Maximum Entropy Approach

Kishore Papineni, Satya Dharanipragada

IBM TJ Watson Research Center, USA

Consider generating phonetic baseforms from orthographic spellings. Availability of a segmentation (grouping) of the characters can be exploited to achieve better phonetic translation. We are interested in building segmentation models without using explicit segmentation or alignment information during training. The heart of our segmentation algorithm is a conditional probabilistic model that predicts whether there are less, equal, or more phones than characters in the word. We use just this contraction-expansion information on whole words for training the model. The model has three components: a prior model, a set of features, and weights of the features. The features are selected and weights assigned in maximum entropy framework. Even though the model is trained on whole words, we effectively localize it on substrings to induce segmentation of the word to be segmented. Segmentation is also aided by considering substrings in both forward and backward directions.

Full Paper

Bibliographic reference.  Papineni, Kishore / Dharanipragada, Satya (1998): "Segmentation using a maximum entropy approach", In ICSLP-1998, paper 0559.