5th European Conference on Speech Communication and Technology

Rhodes, Greece
September 22-25, 1997

A Maximum Likelihood Model for Topic Classification of Broadcast News

Richard Schwartz, Toru Imai (2), Francis Kubala (1), Long Nguyen (1), John Makhoul (1)

(1) BBN Systems and Technologies, Cambridge, MA, USA (2) NHK (Japan Braodcasting Corp.) Sci. & Tech. Res. Labs., Tokyo, Japan

We describe a new algorithm for topic classification that allows discrimination among thousands of topics. A mixture of topics explicitly models the fact that each story has multiple topics, that different words are related to different topics, and that most of the words are not related to any topic. The resulting model, trained by EM, has sharper distributions of words that result in more accurate topic classification. We tested the algorithm on transcribed broadcast news texts. When trained on one year of stories containing over 5,000 different topics and tested on new (later) stories the first choice topic was among the manually annotated choices 76% of the time.

Full Paper

Bibliographic reference.  Schwartz, Richard / Imai, Toru / Kubala, Francis / Nguyen, Long / Makhoul, John (1997): "A maximum likelihood model for topic classification of broadcast news", In EUROSPEECH-1997, 1455-1458.