We evaluate the limitations of the bag-of-words assumption for topic identification of conversational discourse by examining whether topic-dependent word occurrence statistics are also position-independent. We demonstrate where the assumption is violated in conversational speech corpora and show how the relevance of words to the classification task decreases over the length of the document. We seek to improve topic identification by modeling this topic drift phenomenon and weight word counts according to a decay function over the length of the document. By applying a global decay rate for all words we observe reduction in error rates of 23.47% relative on conversational corpora. Furthermore, we apply a minimum classification error (MCE) training procedure to learn per-word decay rates, and reduce error rates by up to an additional 27%.
Bibliographic reference. Wintrode, Jonathan (2013): "Leveraging locality for topic identification of conversational speech", In INTERSPEECH-2013, 1579-1583.