This paper proposes an approach, named phonetic context embedding, to model phonetic context effects for deep neural network - hidden Markov model (DNN-HMM) phone recognition. Phonetic context embeddings can be regarded as continuous and distributed vector representations of context-dependent phonetic units (e.g., triphones). In this work they are computed using neural networks. First, all phone labels are mapped into vectors of binary distinctive features (DFs, e.g., nasal/not-nasal). Then for each speech frame the corresponding DF vector is concatenated with DF vectors of previous and next frames and fed into a neural network that is trained to estimate the acoustic coefficients (e.g., MFCCs) of that frame. The values of the first hidden layer represent the embedding of the input DF vectors. Finally, the resulting embeddings are used as secondary task targets in a multi-task learning (MTL) setting when training the DNN that computes phone state posteriors. The approach allows to easily encode a much larger context than alternative MTL-based approaches. Results on TIMIT with a fully connected DNN shows phone error rate (PER) reductions from 22.4% to 21.0% and from 21.3% to 19.8% on the test core and the validation set respectively and lower PER than an alternative strong MTL approach.

DOI: `10.21437/Interspeech.2016-1036`

Cite as

Badino, L. (2016) Phonetic Context Embeddings for DNN-HMM Phone Recognition. Proc. Interspeech 2016, 405-409.

Bibtex

@inproceedings{Badino2016, author={Leonardo Badino}, title={Phonetic Context Embeddings for DNN-HMM Phone Recognition}, year=2016, booktitle={Interspeech 2016}, doi={10.21437/Interspeech.2016-1036}, url={http://dx.doi.org/10.21437/Interspeech.2016-1036}, pages={405--409} }