This paper describes a phone duration model applied to speech recognition. The model is based on a decision tree that finds clusters of phones in various contexts that tend to have similar durations. Wide contexts with rich linguistic and phonetic features are used. To better model varying and non-stationary speaking rates, the contextual features also include the observed duration values of previous phones. For each resulting phone cluster, a log-normal distribution of duration is estimated. The resulting decision tree and the log-normal distributions are used to calculate likelihoods of phone durations in N-best lists. Experiments on two Estonian recognition tasks show a small but significant improvement in speech recognition accuracy.
Bibliographic reference. Alumäe, Tanel / Nemoto, Rena (2013): "Phone duration modeling using clustering of rich contexts", In INTERSPEECH-2013, 1801-1805.